kdiagram.datasets.make_classification_data

kdiagram.datasets.make_classification_data(n_samples=600, n_features=10, n_classes=2, weights=None, class_sep=1.0, flip_y=0.0, informative_frac=0.6, redundant_frac=0.2, seed=42, n_models=2, model_profiles=None, model_names=None, true_col='y', prefix_label='pred_', prefix_proba='proba_', add_compat_cols=False, include_binary_pred_cols=False, as_frame=False)[source]

Generate a synthetic classification dataset with a configurable feature process and multiple model outputs (labels and/or probabilities).

This helper wraps a standard separable feature generator and then synthesizes the outputs of one or more “models” whose behavior can be controlled via model_profiles or via a simple count n_models. It supports binary and multiclass targets, class imbalance, label noise, explicit model names, and convenient, deterministic column naming.

Parameters:
n_samplesint, default=600

Number of rows to generate.

n_featuresint, default=10

Total number of feature columns.

n_classesint, default=2

Number of classes. Use 2 for binary classification and values greater than 2 for multiclass.

weightslist of float or None, default=None

Class priors that should sum (approximately) to 1. If None, classes are (approximately) balanced.

class_sepfloat, default=1.0

Separation between classes in feature space. Larger values create an easier problem.

flip_yfloat, default=0.0

Fraction of labels to randomly flip as label noise. Must be in [0, 1].

informative_fracfloat, default=0.6

Fraction of features that are informative. Must be in [0, 1] and should satisfy informative_frac + redundant_frac <= 1 [1].

redundant_fracfloat, default=0.2

Fraction of features that are linear combinations of informative features. Must be in [0, 1] and should satisfy informative_frac + redundant_frac <= 1.

seedint or None, default=42

Random seed for reproducibility. None uses non-deterministic entropy.

n_modelsint, default=2

Number of model outputs to synthesize. If model_profiles is provided, only the first n_models entries (in insertion order) are used.

model_profilesdict or None, default=None

Optional per-model configuration. Keys are base model names and values are dicts describing behavior (e.g., logit bias, noise level, calibration skew, thresholding policy). The exact keys supported depend on the implementation. If None, built-in defaults are used.

model_nameslist of str or None, default=None

Display names for the first k models, where k = len(model_names). When provided, the probability and (for binary) label columns for those models are named exactly as given (no prefixes). Remaining models (if any) use prefixed, sanitized names. Extra names beyond n_models are ignored with a warning.

true_colstr, default=”y”

Column name for the ground-truth labels.

prefix_labelstr, default=”pred_”

Prefix for auto-named discrete label columns (only used when a user name is not supplied or when multiclass compat columns are requested).

prefix_probastr, default=”proba_”

Prefix for auto-named probability columns (only used when a user name is not supplied).

add_compat_colsbool, default=False

If True and multiclass, add lightweight compatibility columns that some plotting utilities expect (e.g., yt as an alias of true_col and one yp_<model> column per model with the argmax prediction). Has no effect for pure binary unless the implementation chooses to add aliases.

include_binary_pred_colsbool, default=False

If True and n_classes == 2, add one discrete label column per model in addition to probabilities. Column names follow the explicit model_names when available, otherwise use f"{prefix_label}_<name>".

as_framebool, default=False

If True, return a pandas.DataFrame with tidy columns. Otherwise return a sklearn.utils.Bunch.

Returns:
pandas.DataFrame or sklearn.utils.Bunch

If as_frame=True:

A DataFrame with columns:

[true_col] + feature_names + proba/label columns. For binary, each model typically contributes a single probability column interpreted as the positive-class probability. For multiclass, each model contributes one probability column per class (e.g., name_0, name_1, ...), plus optional compatibility columns if requested.

If as_frame=False:

A Bunch with fields:

frame : the same DataFrame, data : ndarray containing model outputs (shape and content depend on configuration), feature_names : list of str, target_names : list of class labels or integers, target : ndarray of shape (n_samples,), model_names : list of display names, proba_columns : list of probability column labels (if available), label_columns : list of discrete label column labels (if available), DESCR : short description.

Raises:
ValueError

If class priors are invalid, if fractions are outside [0, 1] or sum to more than 1, if model_names length exceeds n_models in an incompatible way, or if other shape checks fail.

Parameters:
Return type:

Bunch | DataFrame

See also

sklearn.datasets.make_classification

Classic feature generator for classification problems.

sklearn.metrics

Utilities to evaluate classification (e.g., AUC, log-loss, accuracy, F1).

Notes

  • Dicts preserve insertion order. Model order follows model_profiles keys, or built-in defaults if profiles are not provided.

  • When model_names is given, those names are used as column labels verbatim for the first k models, allowing clean DataFrames and legends downstream.

  • Probability column layout differs between binary and multiclass. In binary, one column per model is typical. In multiclass, one column per class per model is common, using class indices 0..n_classes-1 unless the implementation defines another convention [2].

References

[1]

Bishop, C. Pattern Recognition and Machine Learning. Springer, 2006.

[2]

Pedregosa et al. Scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830, 2011.

Examples

Binary classification with two named models and explicit label columns.

>>> df = make_classification_data(
...     n_samples=400,
...     n_features=8,
...     n_classes=2,
...     n_models=2,
...     model_names=["Good", "Biased"],
...     include_binary_pred_cols=True,
...     as_frame=True,
...     seed=7,
... )
>>> [c for c in df.columns if c.startswith("Good")][:1]
['Good']

Multiclass with three models and compatibility columns.

>>> df = make_classification_data(
...     n_samples=600,
...     n_features=12,
...     n_classes=4,
...     n_models=3,
...     add_compat_cols=True,
...     as_frame=True,
... )
>>> any(c.startswith("yp_") for c in df.columns)
True