kdiagram.plot.comparison.plot_reliability_diagram

kdiagram.plot.comparison.plot_reliability_diagram(y_true, *y_preds, names=None, sample_weight=None, n_bins=10, strategy='uniform', positive_label=1, class_index=None, clip_probs=(0.0, 1.0), normalize_probs=True, error_bars='wilson', conf_level=0.95, show_diagonal=True, diagonal_kwargs=None, show_ece=True, show_brier=True, counts_panel='bottom', counts_norm='fraction', counts_alpha=0.35, figsize=(9, 7), title=None, xlabel='Predicted probability', ylabel='Observed frequency', cmap='tab10', color_palette=None, marker='o', s=40, linewidth=2.0, alpha=0.9, connect=True, legend=True, legend_loc='best', show_grid=True, grid_props=None, xlim=(0.0, 1.0), ylim=(0.0, 1.0), savefig=None, return_data=False, ax=None, **kw)[source]

Plot a reliability diagram (calibration plot) for one or more classification models.

This compares predicted probabilities to observed frequencies across bins of predicted probability. Perfect calibration lies on the diagonal \(y=x\).

Parameters:
y_truearray_like of shape (n_samples,)

Ground truth labels. For binary calibration, values are compared to positive_label after validation and flattening.

*y_predsarray_like(s)

One or more model predictions. Each item may be:

  • 1D array of positive-class probabilities in [0, 1].

  • 2D array of shape (n_samples, n_classes); use class_index to select a column. If omitted, the last column is used.

nameslist of str, optional

Labels for each model curve. If fewer names are provided than models, placeholders like 'Model_1' are appended.

sample_weightarray_like of shape (n_samples,), optional

Per-sample weights used for observed frequencies, ECE, and Brier score. If None, equal weights are used.

n_binsint, default=10

Number of probability bins.

strategy{‘uniform’, ‘quantile’}, default=’uniform’

Binning strategy.

  • 'uniform': equally spaced edges in [0, 1].

  • 'quantile': edges are empirical quantiles of the pooled predictions. If edges are not unique, the method falls back to uniform binning with a warning.

positive_labelint or float or str, default=1

Label in y_true treated as the positive class when constructing the binary target.

class_indexint, optional

Column index to pick from 2D probability arrays. If omitted, the last column is used.

clip_probstuple of (float, float), default=(0.0, 1.0)

Inclusive clipping range applied to predictions. A warning is issued if clipping occurs.

normalize_probsbool, default=True

If True, attempts to linearly rescale predictions into [0, 1] when minor out-of-range values are detected, then applies clipping.

error_bars{‘wilson’, ‘normal’, ‘none’}, default=’wilson’

Per-bin uncertainty for observed frequencies.

  • 'wilson': Wilson interval using conf_level.

  • 'normal': normal approximation.

  • 'none': no error bars.

conf_levelfloat, default=0.95

Confidence level used for error bars when applicable.

show_diagonalbool, default=True

Draw the reference diagonal \(y=x\).

diagonal_kwargsdict, optional

Matplotlib keyword arguments for the diagonal reference line (e.g., linestyle, color).

show_ecebool, default=True

Compute Expected Calibration Error (ECE) and append a summary to each model label.

show_brierbool, default=True

Compute (weighted) Brier score and append a summary to each model label.

counts_panel{‘none’, ‘bottom’}, default=’bottom’

If not 'none', draw a compact histogram below the main panel that shows per-bin totals for each model.

counts_norm{‘fraction’, ‘count’}, default=’fraction’

Normalization for the counts panel. 'fraction' divides by the total weight; 'count' shows raw weighted sums.

counts_alphafloat, default=0.35

Alpha for bars in the counts panel.

figsizetuple of (float, float), default=(9, 7)

Figure size for the layout. When counts_panel='bottom', a two-row gridspec is used.

titlestr, optional

Title for the plot. If None, no title is set.

xlabelstr, optional

Label for the x-axis. Defaults to 'Predicted probability'.

ylabelstr, optional

Label for the y-axis. Defaults to 'Observed frequency'.

cmapstr, default=’tab10’

Matplotlib colormap name used to generate model colors.

color_palettelist, optional

Explicit list of colors. When provided, colors are cycled from this list instead of the colormap.

markerstr, default=’o’

Marker used for the bin points.

sint, default=40

Marker size for the bin points.

linewidthfloat, default=2.0

Line width used when connecting bin points.

alphafloat, default=0.9

Alpha for points and lines in the main panel.

connectbool, default=True

Connect bin points with a line for each model.

legendbool, default=True

Display a legend. Summary metrics (ECE, Brier) are shown next to model names when enabled.

legend_locstr, default=’best’

Legend location passed to Matplotlib.

show_gridbool, default=True

Toggle gridlines via the package helper set_axis_grid.

grid_propsdict, optional

Keyword arguments passed to set_axis_grid for grid customization (e.g., linestyle, alpha).

xlimtuple of (float, float), default=(0.0, 1.0)

X-axis limits.

ylimtuple of (float, float), default=(0.0, 1.0)

Y-axis limits.

savefigstr, optional

If provided, save the figure to this path; otherwise the plot is shown interactively.

return_databool, default=False

If True, return (ax, data_dict) where values are per-model pandas.DataFrame objects with per-bin stats: ['bin_left', 'bin_right', 'bin_center', 'n', 'w_sum', 'p_mean', 'y_rate', 'y_low', 'y_high', 'ece_contrib']. Otherwise, return only the Matplotlib axes.

Returns:
axmatplotlib.axes.Axes

Axes of the main calibration plot. When counts_panel='bottom', the second axes (counts panel) is not returned.

Parameters:

Notes

Calibration compares confidence to accuracy within bins. For bin \(b\), let \(\hat{p}_i\) be predictions and \(y_i\in\{0,1\}\) be binary targets with weights \(w_i\ge 0\). Define the weighted bin mean probability and accuracy as

(1)\[\bar{p}_b \;=\; \frac{\sum_{i\in b} w_i \hat{p}_i} {\sum_{i\in b} w_i}, \qquad \bar{y}_b \;=\; \frac{\sum_{i\in b} w_i y_i} {\sum_{i\in b} w_i}.\]

The Expected Calibration Error (ECE) is

(2)\[\mathrm{ECE} \;=\; \sum_b \left( \frac{\sum_{i\in b} w_i}{\sum_i w_i} \right) \left| \bar{y}_b - \bar{p}_b \right|.\]

The (weighted) Brier score is

(3)\[\mathrm{Brier} \;=\; \frac{\sum_i w_i \left(\hat{p}_i - y_i\right)^2} {\sum_i w_i}.\]

Wilson confidence intervals for \(\bar{y}_b\) use \(z = \Phi^{-1}\!\left(\tfrac{1+\alpha}{2}\right)\) and effective count \(n_b=\sum_{i\in b} w_i\):

(4)\[\mathrm{center} \;=\; \frac{\bar{y}_b + \frac{z^2}{2 n_b}} {1 + \frac{z^2}{n_b}}, \qquad \mathrm{radius} \;=\; \frac{z}{1 + \frac{z^2}{n_b}} \sqrt{\frac{\bar{y}_b(1-\bar{y}_b)}{n_b} + \frac{z^2}{4 n_b^2}}.\]

The interval is \([\mathrm{center}-\mathrm{radius}, \mathrm{center}+\mathrm{radius}]\), clipped to [0, 1]. The normal interval replaces the term with the usual standard error \(\sqrt{\bar{y}_b(1-\bar{y}_b)/n_b}\).

When strategy='quantile', bin edges are the empirical quantiles of the pooled predictions. If many identical values exist, edges can collapse; in that case, the function falls back to uniform edges with a warning.

Examples

Binary example with quantile bins and Wilson intervals.

>>> import numpy as np
>>> from kdiagram.plot.comparison import \
...     plot_reliability_diagram
>>> rng = np.random.default_rng(0)
>>> y = (rng.random(1000) < 0.4).astype(int)
>>> p1 = 0.4 * np.ones_like(y) + 0.15 * rng.random(len(y))
>>> p2 = 0.4 * np.ones_like(y) + 0.05 * rng.random(len(y))
>>> ax = plot_reliability_diagram(
...     y, p1, p2,
...     names=['Wide', 'Tight'],
...     n_bins=12,
...     strategy='quantile',
...     error_bars='wilson',
...     counts_panel='bottom',
...     show_ece=True,
...     show_brier=True,
...     title=('Reliability Diagram '
...            '(Quantile bins + Wilson CIs)'),
... )