kdiagram.datasets.make_regression_data¶
- kdiagram.datasets.make_regression_data(n_samples=200, n_features=1, feature_range=(0.0, 10.0), n_models=3, model_profiles=None, true_func=None, true_kind='linear', true_coeff_range=(-5.0, 5.0), intercept=5.0, noise_on_true=1.0, heteroskedastic=False, hetero_strength=0.5, prefix='pred_', seed=0, as_frame=False, clip_negative=False, shuffle=True, model_names=None, feature_names=None)[source]¶
Generate a synthetic regression dataset with a configurable true process and multiple model prediction profiles.
This helper builds features, a noisy ground truth, and one or more model predictions with user-controlled bias and noise. It supports additive, multiplicative, and hetero- skedastic error, custom true functions, and deterministic column naming when
model_namesis provided.- Parameters:
- n_samples
int, default=200 Number of rows to generate.
- n_features
int, default=1 Number of feature columns.
- feature_range
tupleoffloat, default=(0.0, 10.0) Closed interval for uniform feature sampling. Must satisfy
hi > lo.- n_models
int, default=3 Number of model prediction columns to create. If
model_profilesis given, only the firstn_modelsentries (in insertion order) are used.- model_profiles
dictorNone, default=None Per-model configuration. Keys are base model names and values are dicts with fields:
bias(float),noise_std(float), anderror_typein{"additive","multiplicative", "hetero"}. IfNone, built-in defaults are used.- true_func
callable()orNone, default=None Custom function with signature
true_func(X: ndarray) -> ndarray shape (n_samples,). IfNone, a built-in shape is chosen viatrue_kind.- true_kind{“linear”,”quadratic”,”sine”}, default=”linear”
Family of the built-in true process when
true_funcisNone.- true_coeff_range
tupleoffloat, default=(-5.0, 5.0) Range used to draw coefficients for built-in shapes.
- intercept
float, default=5.0 Intercept term added to the true process.
- noise_on_true
floatorcallable(), default=1.0 If float, standard deviation of additive Gaussian noise on the ground truth. If callable, it must accept
Xand return an array of shape(n_samples,).- heteroskedasticbool, default=False
If
Trueandnoise_on_trueis a float, scales the ground-truth noise by a function of the first feature.- hetero_strength
float, default=0.5 Strength parameter used for hetero scaling (both for ground-truth noise when
heteroskedastic=Trueand forerror_type="hetero"in model profiles).- prefix
str, default=”pred_” Prefix used for auto-named prediction columns when a user name is not supplied for a model.
- seed
intorNone, default=0 Seed for the internal random generator.
Noneuses non-deterministic entropy.- as_framebool, default=False
If
True, return apandas.DataFramewith tidy columns. Otherwise return asklearn.utils.Bunch.- clip_negativebool, default=False
If
True, clip the ground truth and predictions at zero.- shufflebool, default=True
If
True, row-shuffle the output withseed.- model_names
listofstrorNone, default=None Explicit display names for the first
kmodels, wherek = len(model_names). When provided, the prediction columns for those models are named exactly as given, withoutprefix. Remaining models (if any) usef"{prefix}{snake_case(base_name)}". Extra names beyond the number of models are ignored with a warning.- feature_names
listofstrorNone, default=None Names for feature columns. Must have length equal to
n_features. IfNone, uses["feature_1", ...].
- n_samples
- Returns:
pandas.DataFrameorsklearn.utils.BunchIf
as_frame=True:A DataFrame with columns
["y_true"] + feature_names + prediction_cols.- If
as_frame=False: A Bunch with fields:
frame: the same DataFrame,data: ndarray of shape(n_samples, n_models), containing predictions ordered as inprediction_columns,feature_names: list of str,target_names:["y_true"],target: ndarray of shape(n_samples,),model_names: list of display names,prediction_columns: list of column labels,prefix: str,DESCR: short description.
- If
- Raises:
ValueErrorIf
feature_rangeis invalid, if shapes returned bytrue_funcor a noise callable are not(n_samples,), iftrue_kindis unknown, if amodel_profilesentry has an unknownerror_type, or iffeature_nameslength mismatchesn_features.
- Parameters:
- Return type:
Bunch | DataFrame
See also
sklearn.datasets.make_regressionClassic linear regression toy dataset.
numpy.random.GeneratorModern NumPy RNG used for reproducibility.
Notes
Python dicts preserve insertion order. The order of models is taken from
model_profileskeys, or from the built-in defaults when profiles are not supplied.When
model_namesis provided, those names are used as the column labels verbatim for the firstkmodels. This allows clean, human-readable headers in a DataFrame and consistent legend labels downstream.For
error_type="multiplicative", prediction noise is applied as a multiplicative factor around 1 [1]. For"hetero", the model’s noise is scaled by a normalized transform of the first feature andhetero_strength[2].Reproducibility is controlled by
seed. Set it to an integer for deterministic output.
References
[1]Hastie, Tibshirani, Friedman. The Elements of Statistical Learning. Springer, 2009.
[2]Hyndman, Athanasopoulos. Forecasting: Principles and Practice. OTexts, 3rd ed., 2021.
Examples
Create two models with explicit names and return a frame.
>>> from kdiagram.datasets.make import make_regression_data >>> profiles = { ... "Good Model": {"bias": 0.0, "noise_std": 5.0, ... "error_type": "additive"}, ... "Biased Model": {"bias": -10.0, "noise_std": 2.0, ... "error_type": "additive"}, ... } >>> df = make_regression_data( ... n_samples=200, ... n_features=1, ... n_models=2, ... model_profiles=profiles, ... model_names=["Good Model", "Biased Model"], ... as_frame=True, ... seed=42, ... ) >>> list(df.columns)[:3] ['y_true', 'feature_1', 'Good Model']
Use a custom true function and heteroskedastic noise.
>>> def ftrue(X): ... return 3.0 * X[:, 0] + 2.0 >>> df = make_regression_data( ... n_samples=100, ... true_func=ftrue, ... noise_on_true=1.5, ... heteroskedastic=True, ... as_frame=True, ... )
Return a Bunch for direct array access.
>>> b = make_regression_data( ... n_samples=50, ... n_models=3, ... as_frame=False, ... ) >>> b.data.shape (50, 3)