kdiagram.datasets.make_regression_data¶

kdiagram.datasets.make_regression_data(n_samples=200, n_features=1, feature_range=(0.0, 10.0), n_models=3, model_profiles=None, true_func=None, true_kind='linear', true_coeff_range=(-5.0, 5.0), intercept=5.0, noise_on_true=1.0, heteroskedastic=False, hetero_strength=0.5, prefix='pred_', seed=0, as_frame=False, clip_negative=False, shuffle=True, model_names=None, feature_names=None)[source]¶

Generate a synthetic regression dataset with a configurable true process and multiple model prediction profiles.

This helper builds features, a noisy ground truth, and one or more model predictions with user-controlled bias and noise. It supports additive, multiplicative, and hetero- skedastic error, custom true functions, and deterministic column naming when model_names is provided.

Parameters:

n_samplesint, default=200: Number of rows to generate.
n_featuresint, default=1: Number of feature columns.
feature_rangetuple of float, default=(0.0, 10.0): Closed interval for uniform feature sampling. Must satisfy hi > lo.
n_modelsint, default=3: Number of model prediction columns to create. If model_profiles is given, only the first n_models entries (in insertion order) are used.
model_profilesdict or None, default=None: Per-model configuration. Keys are base model names and values are dicts with fields: bias (float), noise_std (float), and error_type in {"additive","multiplicative", "hetero"}. If None, built-in defaults are used.
true_funccallable() or None, default=None: Custom function with signature true_func(X: ndarray) -> ndarray shape (n_samples,). If None, a built-in shape is chosen via true_kind.
true_kind{“linear”,”quadratic”,”sine”}, default=”linear”: Family of the built-in true process when true_func is None.
true_coeff_rangetuple of float, default=(-5.0, 5.0): Range used to draw coefficients for built-in shapes.
interceptfloat, default=5.0: Intercept term added to the true process.
noise_on_truefloat or callable(), default=1.0: If float, standard deviation of additive Gaussian noise on the ground truth. If callable, it must accept X and return an array of shape (n_samples,).
heteroskedasticbool, default=False: If True and noise_on_true is a float, scales the ground-truth noise by a function of the first feature.
hetero_strengthfloat, default=0.5: Strength parameter used for hetero scaling (both for ground-truth noise when heteroskedastic=True and for error_type="hetero" in model profiles).
prefixstr, default=”pred_”: Prefix used for auto-named prediction columns when a user name is not supplied for a model.
seedint or None, default=0: Seed for the internal random generator. None uses non-deterministic entropy.
as_framebool, default=False: If True, return a pandas.DataFrame with tidy columns. Otherwise return a sklearn.utils.Bunch.
clip_negativebool, default=False: If True, clip the ground truth and predictions at zero.
shufflebool, default=True: If True, row-shuffle the output with seed.
model_nameslist of str or None, default=None: Explicit display names for the first k models, where k = len(model_names). When provided, the prediction columns for those models are named exactly as given, without prefix. Remaining models (if any) use f"{prefix}{snake_case(base_name)}". Extra names beyond the number of models are ignored with a warning.
feature_nameslist of str or None, default=None: Names for feature columns. Must have length equal to n_features. If None, uses ["feature_1", ...].

Returns:

pandas.DataFrame or sklearn.utils.Bunch

If as_frame=True:

A DataFrame with columns ["y_true"] + feature_names + prediction_cols.

If as_frame=False:: A Bunch with fields:

frame : the same DataFrame, data : ndarray of shape (n_samples, n_models), containing predictions ordered as in prediction_columns, feature_names : list of str, target_names : ["y_true"], target : ndarray of shape (n_samples,), model_names : list of display names, prediction_columns : list of column labels, prefix : str, DESCR : short description.

Raises:

ValueError: If feature_range is invalid, if shapes returned by true_func or a noise callable are not (n_samples,), if true_kind is unknown, if a model_profiles entry has an unknown error_type, or if feature_names length mismatches n_features.

Parameters:

n_samples (int)
n_features (int)
feature_range (tuple[float, float])
n_models (int)
model_profiles (dict[str, dict[str, Any]] | None)
true_func (Callable[[ndarray], ndarray] | None)
true_kind (str)
true_coeff_range (tuple[float, float])
intercept (float)
noise_on_true (float | Callable[[ndarray], ndarray])
heteroskedastic (bool)
hetero_strength (float)
prefix (str)
seed (int | None)
as_frame (bool)
clip_negative (bool)
shuffle (bool)
model_names (list[str] | None)
feature_names (list[str] | None)

Return type:

Bunch | DataFrame

See also

sklearn.datasets.make_regression: Classic linear regression toy dataset.
numpy.random.Generator: Modern NumPy RNG used for reproducibility.

Notes

Python dicts preserve insertion order. The order of models is taken from model_profiles keys, or from the built-in defaults when profiles are not supplied.
When model_names is provided, those names are used as the column labels verbatim for the first k models. This allows clean, human-readable headers in a DataFrame and consistent legend labels downstream.
For error_type="multiplicative", prediction noise is applied as a multiplicative factor around 1 [1]. For "hetero", the model’s noise is scaled by a normalized transform of the first feature and hetero_strength [2].
Reproducibility is controlled by seed. Set it to an integer for deterministic output.

References

[1]

Hastie, Tibshirani, Friedman. The Elements of Statistical Learning. Springer, 2009.

[2]

Hyndman, Athanasopoulos. Forecasting: Principles and Practice. OTexts, 3rd ed., 2021.

Examples

Create two models with explicit names and return a frame.

>>> from kdiagram.datasets.make import make_regression_data
>>> profiles = {
...     "Good Model": {"bias": 0.0, "noise_std": 5.0,
...                    "error_type": "additive"},
...     "Biased Model": {"bias": -10.0, "noise_std": 2.0,
...                      "error_type": "additive"},
... }
>>> df = make_regression_data(
...     n_samples=200,
...     n_features=1,
...     n_models=2,
...     model_profiles=profiles,
...     model_names=["Good Model", "Biased Model"],
...     as_frame=True,
...     seed=42,
... )
>>> list(df.columns)[:3]
['y_true', 'feature_1', 'Good Model']

Use a custom true function and heteroskedastic noise.

>>> def ftrue(X):
...     return 3.0 * X[:, 0] + 2.0
>>> df = make_regression_data(
...     n_samples=100,
...     true_func=ftrue,
...     noise_on_true=1.5,
...     heteroskedastic=True,
...     as_frame=True,
... )

Return a Bunch for direct array access.

>>> b = make_regression_data(
...     n_samples=50,
...     n_models=3,
...     as_frame=False,
... )
>>> b.data.shape
(50, 3)