kdiagram.datasets.make_classification_data¶
- kdiagram.datasets.make_classification_data(n_samples=600, n_features=10, n_classes=2, weights=None, class_sep=1.0, flip_y=0.0, informative_frac=0.6, redundant_frac=0.2, seed=42, n_models=2, model_profiles=None, model_names=None, true_col='y', prefix_label='pred_', prefix_proba='proba_', add_compat_cols=False, include_binary_pred_cols=False, as_frame=False)[source]¶
Generate a synthetic classification dataset with a configurable feature process and multiple model outputs (labels and/or probabilities).
This helper wraps a standard separable feature generator and then synthesizes the outputs of one or more “models” whose behavior can be controlled via
model_profilesor via a simple countn_models. It supports binary and multiclass targets, class imbalance, label noise, explicit model names, and convenient, deterministic column naming.- Parameters:
- n_samples
int, default=600 Number of rows to generate.
- n_features
int, default=10 Total number of feature columns.
- n_classes
int, default=2 Number of classes. Use
2for binary classification and values greater than 2 for multiclass.- weights
listoffloatorNone, default=None Class priors that should sum (approximately) to 1. If
None, classes are (approximately) balanced.- class_sep
float, default=1.0 Separation between classes in feature space. Larger values create an easier problem.
- flip_y
float, default=0.0 Fraction of labels to randomly flip as label noise. Must be in
[0, 1].- informative_frac
float, default=0.6 Fraction of features that are informative. Must be in
[0, 1]and should satisfyinformative_frac + redundant_frac <= 1[1].- redundant_frac
float, default=0.2 Fraction of features that are linear combinations of informative features. Must be in
[0, 1]and should satisfyinformative_frac + redundant_frac <= 1.- seed
intorNone, default=42 Random seed for reproducibility.
Noneuses non-deterministic entropy.- n_models
int, default=2 Number of model outputs to synthesize. If
model_profilesis provided, only the firstn_modelsentries (in insertion order) are used.- model_profiles
dictorNone, default=None Optional per-model configuration. Keys are base model names and values are dicts describing behavior (e.g., logit bias, noise level, calibration skew, thresholding policy). The exact keys supported depend on the implementation. If
None, built-in defaults are used.- model_names
listofstrorNone, default=None Display names for the first
kmodels, wherek = len(model_names). When provided, the probability and (for binary) label columns for those models are named exactly as given (no prefixes). Remaining models (if any) use prefixed, sanitized names. Extra names beyondn_modelsare ignored with a warning.- true_col
str, default=”y” Column name for the ground-truth labels.
- prefix_label
str, default=”pred_” Prefix for auto-named discrete label columns (only used when a user name is not supplied or when multiclass compat columns are requested).
- prefix_proba
str, default=”proba_” Prefix for auto-named probability columns (only used when a user name is not supplied).
- add_compat_colsbool, default=False
If
Trueand multiclass, add lightweight compatibility columns that some plotting utilities expect (e.g.,ytas an alias oftrue_coland oneyp_<model>column per model with the argmax prediction). Has no effect for pure binary unless the implementation chooses to add aliases.- include_binary_pred_colsbool, default=False
If
Trueandn_classes == 2, add one discrete label column per model in addition to probabilities. Column names follow the explicitmodel_nameswhen available, otherwise usef"{prefix_label}_<name>".- as_framebool, default=False
If
True, return apandas.DataFramewith tidy columns. Otherwise return asklearn.utils.Bunch.
- n_samples
- Returns:
pandas.DataFrameorsklearn.utils.BunchIf
as_frame=True:A DataFrame with columns:
[true_col] + feature_names + proba/label columns. For binary, each model typically contributes a single probability column interpreted as the positive-class probability. For multiclass, each model contributes one probability column per class (e.g.,name_0, name_1, ...), plus optional compatibility columns if requested.If
as_frame=False:A Bunch with fields:
frame: the same DataFrame,data: ndarray containing model outputs (shape and content depend on configuration),feature_names: list of str,target_names: list of class labels or integers,target: ndarray of shape(n_samples,),model_names: list of display names,proba_columns: list of probability column labels (if available),label_columns: list of discrete label column labels (if available),DESCR: short description.
- Raises:
ValueErrorIf class priors are invalid, if fractions are outside
[0, 1]or sum to more than 1, ifmodel_nameslength exceedsn_modelsin an incompatible way, or if other shape checks fail.
- Parameters:
- Return type:
Bunch | DataFrame
See also
sklearn.datasets.make_classificationClassic feature generator for classification problems.
sklearn.metricsUtilities to evaluate classification (e.g., AUC, log-loss, accuracy, F1).
Notes
Dicts preserve insertion order. Model order follows
model_profileskeys, or built-in defaults if profiles are not provided.When
model_namesis given, those names are used as column labels verbatim for the firstkmodels, allowing clean DataFrames and legends downstream.Probability column layout differs between binary and multiclass. In binary, one column per model is typical. In multiclass, one column per class per model is common, using class indices
0..n_classes-1unless the implementation defines another convention [2].
References
[1]Bishop, C. Pattern Recognition and Machine Learning. Springer, 2006.
[2]Pedregosa et al. Scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830, 2011.
Examples
Binary classification with two named models and explicit label columns.
>>> df = make_classification_data( ... n_samples=400, ... n_features=8, ... n_classes=2, ... n_models=2, ... model_names=["Good", "Biased"], ... include_binary_pred_cols=True, ... as_frame=True, ... seed=7, ... ) >>> [c for c in df.columns if c.startswith("Good")][:1] ['Good']
Multiclass with three models and compatibility columns.
>>> df = make_classification_data( ... n_samples=600, ... n_features=12, ... n_classes=4, ... n_models=3, ... add_compat_cols=True, ... as_frame=True, ... ) >>> any(c.startswith("yp_") for c in df.columns) True