kdiagram.utils.bin_by_feature

kdiagram.utils.bin_by_feature(df, bin_on_col, target_cols, n_bins=10, agg_funcs='mean')[source]

Bins data by a feature and computes aggregate statistics.

This is a powerful data wrangling utility that groups a DataFrame into bins based on the values in a specified column (bin_on_col). It then calculates aggregate statistics (like mean, std, etc.) for one or more target columns within each bin. This is the core logic behind plots like plot_error_bands.

Parameters:
dfpd.DataFrame

The input DataFrame.

bin_on_colstr

The name of the column whose values will be used for binning. This column must contain numeric data.

target_colsstr or list of str

The name(s) of the column(s) for which to compute statistics.

n_binsint, default=10

The number of equal-width bins to create.

agg_funcsstr, list of str, or dict, default=’mean’

The aggregation function(s) to apply. Can be any function accepted by pandas’ .agg() method (e.g., ‘mean’, ‘std’, [‘mean’, ‘std’], or {‘col_A’: ‘sum’}).

Returns:
pd.DataFrame

A DataFrame containing the aggregate statistics for each bin.

Parameters:
Return type:

DataFrame

See also

pandas.cut

The underlying pandas function used for binning.

pandas.DataFrame.groupby

The underlying pandas function for aggregation.

plot_error_bands

A plot that uses this binning logic.

Notes

This function first uses pandas.cut to partition the values in bin_on_col into n_bins discrete, equal-width intervals. It then uses pandas.DataFrame.groupby to group the DataFrame by these new bins and applies the specified aggregation function(s) to the target_cols for each group.

Examples

>>> import pandas as pd
>>> from kdiagram.utils.forecast_utils import bin_by_feature
>>>
>>> df = pd.DataFrame({
...     'forecast_value': [10, 12, 20, 22, 30, 32],
...     'error': [-1, 1.5, -2, 2.5, -3, 3.5]
... })
>>>
>>> # Calculate the mean and standard deviation of the error,
>>> # binned by the forecast value.
>>> binned_stats = bin_by_feature(
...     df,
...     bin_on_col='forecast_value',
...     target_cols='error',
...     n_bins=3,
...     agg_funcs=['mean', 'std']
... )
>>> print(binned_stats)
  forecast_value_bin  mean       std
0      (9.978, 17.333]  0.25  1.767767
1   (17.333, 24.667]  0.25  3.181981
2     (24.667, 32.0]  0.25  4.596194