kdiagram.datasets.make_regression_data

kdiagram.datasets.make_regression_data(n_samples=200, n_features=1, feature_range=(0.0, 10.0), n_models=3, model_profiles=None, true_func=None, true_kind='linear', true_coeff_range=(-5.0, 5.0), intercept=5.0, noise_on_true=1.0, heteroskedastic=False, hetero_strength=0.5, prefix='pred_', seed=0, as_frame=False, clip_negative=False, shuffle=True, model_names=None, feature_names=None)[source]

Generate a synthetic regression dataset with a configurable true process and multiple model prediction profiles.

This helper builds features, a noisy ground truth, and one or more model predictions with user-controlled bias and noise. It supports additive, multiplicative, and hetero- skedastic error, custom true functions, and deterministic column naming when model_names is provided.

Parameters:
n_samplesint, default=200

Number of rows to generate.

n_featuresint, default=1

Number of feature columns.

feature_rangetuple of float, default=(0.0, 10.0)

Closed interval for uniform feature sampling. Must satisfy hi > lo.

n_modelsint, default=3

Number of model prediction columns to create. If model_profiles is given, only the first n_models entries (in insertion order) are used.

model_profilesdict or None, default=None

Per-model configuration. Keys are base model names and values are dicts with fields: bias (float), noise_std (float), and error_type in {"additive","multiplicative", "hetero"}. If None, built-in defaults are used.

true_funccallable() or None, default=None

Custom function with signature true_func(X: ndarray) -> ndarray shape (n_samples,). If None, a built-in shape is chosen via true_kind.

true_kind{“linear”,”quadratic”,”sine”}, default=”linear”

Family of the built-in true process when true_func is None.

true_coeff_rangetuple of float, default=(-5.0, 5.0)

Range used to draw coefficients for built-in shapes.

interceptfloat, default=5.0

Intercept term added to the true process.

noise_on_truefloat or callable(), default=1.0

If float, standard deviation of additive Gaussian noise on the ground truth. If callable, it must accept X and return an array of shape (n_samples,).

heteroskedasticbool, default=False

If True and noise_on_true is a float, scales the ground-truth noise by a function of the first feature.

hetero_strengthfloat, default=0.5

Strength parameter used for hetero scaling (both for ground-truth noise when heteroskedastic=True and for error_type="hetero" in model profiles).

prefixstr, default=”pred_”

Prefix used for auto-named prediction columns when a user name is not supplied for a model.

seedint or None, default=0

Seed for the internal random generator. None uses non-deterministic entropy.

as_framebool, default=False

If True, return a pandas.DataFrame with tidy columns. Otherwise return a sklearn.utils.Bunch.

clip_negativebool, default=False

If True, clip the ground truth and predictions at zero.

shufflebool, default=True

If True, row-shuffle the output with seed.

model_nameslist of str or None, default=None

Explicit display names for the first k models, where k = len(model_names). When provided, the prediction columns for those models are named exactly as given, without prefix. Remaining models (if any) use f"{prefix}{snake_case(base_name)}". Extra names beyond the number of models are ignored with a warning.

feature_nameslist of str or None, default=None

Names for feature columns. Must have length equal to n_features. If None, uses ["feature_1", ...].

Returns:
pandas.DataFrame or sklearn.utils.Bunch

If as_frame=True:

A DataFrame with columns ["y_true"] + feature_names + prediction_cols.

If as_frame=False:

A Bunch with fields:

frame : the same DataFrame, data : ndarray of shape (n_samples, n_models), containing predictions ordered as in prediction_columns, feature_names : list of str, target_names : ["y_true"], target : ndarray of shape (n_samples,), model_names : list of display names, prediction_columns : list of column labels, prefix : str, DESCR : short description.

Raises:
ValueError

If feature_range is invalid, if shapes returned by true_func or a noise callable are not (n_samples,), if true_kind is unknown, if a model_profiles entry has an unknown error_type, or if feature_names length mismatches n_features.

Parameters:
Return type:

Bunch | DataFrame

See also

sklearn.datasets.make_regression

Classic linear regression toy dataset.

numpy.random.Generator

Modern NumPy RNG used for reproducibility.

Notes

  • Python dicts preserve insertion order. The order of models is taken from model_profiles keys, or from the built-in defaults when profiles are not supplied.

  • When model_names is provided, those names are used as the column labels verbatim for the first k models. This allows clean, human-readable headers in a DataFrame and consistent legend labels downstream.

  • For error_type="multiplicative", prediction noise is applied as a multiplicative factor around 1 [1]. For "hetero", the model’s noise is scaled by a normalized transform of the first feature and hetero_strength [2].

  • Reproducibility is controlled by seed. Set it to an integer for deterministic output.

References

[1]

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning. Springer, 2009.

[2]

Hyndman, Athanasopoulos. Forecasting: Principles and Practice. OTexts, 3rd ed., 2021.

Examples

Create two models with explicit names and return a frame.

>>> from kdiagram.datasets.make import make_regression_data
>>> profiles = {
...     "Good Model": {"bias": 0.0, "noise_std": 5.0,
...                    "error_type": "additive"},
...     "Biased Model": {"bias": -10.0, "noise_std": 2.0,
...                      "error_type": "additive"},
... }
>>> df = make_regression_data(
...     n_samples=200,
...     n_features=1,
...     n_models=2,
...     model_profiles=profiles,
...     model_names=["Good Model", "Biased Model"],
...     as_frame=True,
...     seed=42,
... )
>>> list(df.columns)[:3]
['y_true', 'feature_1', 'Good Model']

Use a custom true function and heteroskedastic noise.

>>> def ftrue(X):
...     return 3.0 * X[:, 0] + 2.0
>>> df = make_regression_data(
...     n_samples=100,
...     true_func=ftrue,
...     noise_on_true=1.5,
...     heteroskedastic=True,
...     as_frame=True,
... )

Return a Bunch for direct array access.

>>> b = make_regression_data(
...     n_samples=50,
...     n_models=3,
...     as_frame=False,
... )
>>> b.data.shape
(50, 3)