Model Comparison Gallery¶

This gallery page showcases plots from k-diagram designed for comparing the performance of multiple models across various metrics, primarily using radar charts.

Note

You need to run the code snippets locally to generate the plot images referenced below (e.g., images/gallery_model_comparison.png). Ensure the image paths in the .. image:: directives match where you save the plots (likely an images subdirectory relative to this file).

Multi-Metric Model Comparison¶

The plot_model_comparison() function is a tool for moving beyond single-score evaluations. It creates a polar radar chart to visualize and compare multiple models across several performance metrics simultaneously, providing a holistic “fingerprint” of each model’s strengths and weaknesses.

First, let’s break down the components of this comparative plot.

Plot Anatomy

Angle (θ): Each angular axis represents a different performance metric (e.g., R², MAE, Training Time).
Radius (r): Corresponds to the normalized performance score for that metric, typically scaled to the range [0, 1]. To maintain consistency, all metrics are scaled such that a larger radius is always better (e.g., lower MAE or faster training time results in a larger radius).
Polygon: Each colored polygon represents a model, with its vertices showing its performance on each metric. The overall shape and size of the polygon provide an at-a-glance summary of the model’s performance profile.

With this framework, we can now apply the plot to a real-world model selection problem, progressing from a standard regression task to a more nuanced classification task.

Use Case 1: Standard Regression Model Comparison

The most common use for this plot is to select the best model for a standard regression task by balancing accuracy, error, and efficiency.

Let’s imagine an analytics team at an e-commerce company has built three different models to predict sales revenue: a fast but simple Ridge regression, a Lasso model that performs feature selection, and a more complex Decision Tree. They need to choose the best all-around performer.

import kdiagram as kd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Data Generation: Sales Revenue Forecast ---
np.random.seed(42)
n_samples = 100
y_true_reg = np.random.rand(n_samples) * 50 + 10 # True revenue
# Model 1 (Ridge): Good fit, fast
y_pred_r1 = y_true_reg + np.random.normal(0, 4, n_samples)
# Model 2 (Lasso): Similar fit, slightly slower
y_pred_r2 = y_true_reg * 0.98 + 1 + np.random.normal(0, 4.5, n_samples)
# Model 3 (Tree): Overfit, slower, poor on some metrics
y_pred_r3 = y_true_reg + np.random.normal(2, 8, n_samples)

times = [0.1, 0.3, 0.8] # Training times in seconds
names = ['Ridge', 'Lasso', 'Tree']

# --- 2. Plotting ---
# Using default regression metrics: ['r2', 'mae', 'mape', 'rmse']
kd.plot_model_comparison(
    y_true_reg,
    y_pred_r1, y_pred_r2, y_pred_r3,
    train_times=times,
    names=names,
    title="Use Case 1: E-Commerce Sales Model Comparison",
    scale='norm',
    savefig="gallery/images/gallery_model_comparison_regression.png"
)
plt.close()

A radar chart comparing three regression models. — A radar chart showing the performance profiles of Ridge, Lasso, and Decision Tree models across five different metrics.¶

Use Case 2: Evaluating a Classification Task with a Custom Metric

This plot is equally usefull for classification. The default metrics will automatically switch to [‘accuracy’, ‘precision’, ‘recall’, ‘f1’], but we can also provide our own custom metrics to evaluate performance on criteria that are specific to our business problem.

Let’s consider a medical diagnosis model that predicts whether a patient has a rare disease. In this case, Recall (correctly identifying sick patients) is far more important than Precision. We can create a custom, weighted F-beta score to reflect this and add it to our plot.

from sklearn.metrics import fbeta_score

# --- 1. Data Generation: Medical Diagnosis ---
np.random.seed(0)
n_samples = 200
y_true_clf = np.array([0] * 180 + [1] * 20) # Imbalanced data
# Model A: High precision, but misses sick patients (low recall)
y_pred_A = np.copy(y_true_clf)
y_pred_A[np.random.choice(np.where(y_true_clf==1)[0], 12, False)] = 0
# Model B: Lower precision, but finds most sick patients (high recall)
y_pred_B = np.copy(y_true_clf)
y_pred_B[np.random.choice(np.where(y_true_clf==0)[0], 20, False)] = 1

# --- 2. Define a custom metric that prioritizes Recall ---
# An F-beta score with beta=2 weighs recall higher than precision
f2_score = lambda y_true, y_pred: fbeta_score(y_true, y_pred, beta=2)
f2_score.__name__ = "F2-Score (Recall Focus)" # Give it a nice name for the plot

# --- 3. Plotting with default and custom metrics ---
kd.plot_model_comparison(
    y_true_clf,
    y_pred_A,
    y_pred_B,
    names=['Model A (High Precision)', 'Model B (High Recall)'],
    metrics=['accuracy', 'precision', 'recall', f2_score], # Add our custom metric
    title="Use Case 2: Medical Diagnosis Classifier Comparison",
    scale='norm',
    savefig="gallery/images/gallery_model_comparison_classification.png"
)
plt.close()

A radar chart comparing two classification models with a custom metric. — A radar chart showing how two classifiers perform on standard metrics as well as a custom “F2-Score” that prioritizes recall.¶

Best Practice

Don’t rely solely on default metrics. For real-world problems, business needs often dictate that some errors are more costly than others. Adding custom metrics to the plot_model_comparison function, as shown in this use case, is a powerful way to ensure your model evaluation aligns with your specific goals.

For a deeper understanding of the statistical concepts behind these evaluation metrics, please refer back to the main Multi-Metric Model Comparison (plot_model_comparison()) section.

Model Reliability (Calibration) Diagram¶

The plot_reliability_diagram() is the industry-standard tool for assessing the calibration of a binary classifier. It answers a crucial question: “When my model predicts a 70% probability of an event, does that event actually happen 70% of the time?” A model whose probabilities accurately reflect real-world frequencies is considered “well-calibrated” and is essential for making trustworthy, risk-based decisions.

Let’s begin by breaking down the components of this fundamental plot.

Plot Anatomy

X-Axis (Mean Predicted Probability): For each bin, this is the average of the probabilities predicted by the model. This is also referred to as the forecast’s confidence.
Y-Axis (Observed Frequency): For each bin, this is the actual fraction of positive cases observed in the data. This is also referred to as the forecast’s accuracy.
Diagonal Line (\(y=x\)): This is the line of perfect calibration. A model whose points fall on this line is perfectly calibrated.
Counts Panel (Bottom): A histogram showing the number of predictions that fall into each probability bin, which helps in diagnosing if the model is timid (most predictions near 0.5) or decisive (most predictions near 0 or 1).

With this in mind, let’s explore how to use this plot to diagnose and compare the reliability of different models.

Use Case 1: Basic Calibration Check with Uniform Bins

The most common use case is to get a quick, initial assessment of a single model’s calibration. For this, we can use the default uniform binning strategy, which creates equally spaced bins across the [0, 1] probability range.

Let’s evaluate a model trained to predict customer churn, where a “1” means the customer is likely to leave.

import kdiagram as kd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Data Generation: Customer Churn Predictions ---
np.random.seed(0)
n_customers = 2000
# True outcome: ~30% of customers churn
y_true = (np.random.rand(n_customers) < 0.3).astype(int)
# A reasonably good, but not perfect, model
y_pred = np.clip(y_true * 0.4 + 0.3 + np.random.normal(0, 0.15, n_customers), 0.01, 0.99)

# --- 2. Plotting ---
kd.plot_reliability_diagram(
    y_true, y_pred,
    names=['Churn Model'],
    n_bins=10,
    strategy="uniform", # Default, but explicit here
    title='Use Case 1: Basic Calibration Check',
    savefig="gallery/images/gallery_reliability_diagram_basic.png"
)
plt.close()

A basic reliability diagram showing a single model's calibration. — A reliability diagram showing the model’s calibration curve relative to the perfect diagonal. The counts panel below shows the distribution of its predictions.¶

Use Case 2: Comparing Models with Quantile Binning

A more advanced task is to compare the reliability of multiple competing models. For this, quantile binning is often superior to uniform binning, as it ensures that each bin contains an equal number of samples, providing a more stable estimate of the observed frequency.

Let’s compare our “Churn Model” to a new “Calibrated Model” that has been post-processed to improve its reliability.

# --- 1. Data Generation (uses y_true and y_pred from previous step) ---
# Create a second, better-calibrated model's predictions
y_pred_calibrated = np.clip(y_true * 0.35 + 0.32 + np.random.normal(0, 0.1, n_customers), 0.01, 0.99)

# --- 2. Plotting ---
kd.plot_reliability_diagram(
    y_true, y_pred, y_pred_calibrated,
    names=['Original Model', 'Calibrated Model'],
    n_bins=12,
    strategy="quantile", # Use quantile binning for a stable comparison
    error_bars="wilson",  # Add Wilson confidence intervals
    title='Use Case 2: Comparing Model Reliability',
    savefig="gallery/images/gallery_reliability_diagram_compare.png"
)
plt.close()

A reliability diagram comparing two models using quantile binning. — Two calibration curves are shown. The “Calibrated Model” (orange) hugs the diagonal line more closely than the “Original Model” (blue).¶

🧠 Interpretation

This side-by-side comparison on the same axes reveals the distinct calibration profiles of the two models. The Original Model (blue) clearly deviates from the diagonal, exhibiting significant under-confidence for predicted probabilities between 0.4 and 0.6. The “Calibrated Model” (orange) shows a different pattern of miscalibration, with a noticeable “S” shape where it is first under-confident and then over-confident.

Interestingly, the quantitative metrics in the legend confirm this visual assessment: the attempted calibration was not successful in this case, as the “Calibrated Model” has a slightly worse (higher) ECE score than the original. This is a perfect example of why reliability diagrams are so crucial—they provide a nuanced diagnostic that goes beyond simple labels and reveals the true behavior of a model’s probability outputs.

Best Practice

When comparing multiple models, using strategy="quantile" is highly recommended. It prevents bins from being empty and provides more stable and reliable estimates of the observed frequencies, leading to a fairer comparison between models. Also, including error bars (e.g., error_bars="wilson") provides crucial context about the statistical uncertainty of your assessment.

Polar Reliability Diagram (Calibration Spiral)¶

The plot_polar_reliability() function provides a novel and highly intuitive visualization of model calibration. It transforms the traditional reliability diagram into a “calibration spiral,” where deviations from a perfect spiral immediately reveal the nature and location of a model’s miscalibrations through diagnostic coloring.

First, let’s break down the components of this innovative plot.

Plot Anatomy

Angle (θ): Represents the mean predicted probability (\(\bar{p}_k\)) for each bin, sweeping from 0.0 at 0° to 1.0 at 90°. This is the model’s confidence.
Radius (r): Represents the observed frequency of the event (\(\bar{y}_k\)) for each bin. This is the actual outcome.
Perfect Calibration Spiral: The dashed black line represents the ideal case where \(r = \frac{2\theta}{\pi}\) (\(\bar{y}_k = \bar{p}_k\)). A model’s spiral should lie directly on this line.
Color: The color of the model’s spiral is a diagnostic tool, representing the calibration error (\(\bar{y}_k - \bar{p}_k\)). Colors on one side of the colormap’s center (e.g., reds) indicate over-confidence, while colors on the other side (e.g., blues) indicate under-confidence.

With this in mind, let’s apply the plot to a real-world problem to see how it uncovers different types of miscalibration.

Use Case 1: Diagnosing an Over-Confident Model

A common failure mode for classifiers, especially on complex tasks, is overconfidence. The model assigns high probabilities to its predictions, but its real-world accuracy doesn’t match this high level of certainty.

Let’s simulate a scenario in medical diagnostics, where a model is trained to predict the probability of a disease. An overconfident model might predict a 90% probability of disease when the actual rate for such patients is only 70%, which could lead to unnecessary treatments.

import kdiagram as kd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Data Generation: A Well-Calibrated vs. Over-Confident Model ---
np.random.seed(0)
n_samples = 2000
# The true probability of the event is 0.4
y_true = (np.random.rand(n_samples) < 0.4).astype(int)

# A well-calibrated model's probabilities are realistic
calibrated_preds = np.clip(0.4 + np.random.normal(0, 0.15, n_samples), 0, 1)

# An over-confident model pushes probabilities towards the extremes of 0 and 1
overconfident_preds = np.clip(0.4 + np.random.normal(0, 0.3, n_samples), 0, 1)

# --- 2. Plotting ---
kd.plot_polar_reliability(
    y_true,
    calibrated_preds,
    overconfident_preds,
    names=["Well-Calibrated", "Over-Confident"],
    n_bins=15,
    cmap='coolwarm',
    title="Use Case 1: Diagnosing an Over-Confident Model",
    savefig="gallery/images/gallery_polar_reliability_overconfident.png"
)
plt.close()

A polar reliability diagram showing one well-calibrated and one over-confident model. — The “Well-Calibrated” model’s spiral closely follows the dashed reference line, while the “Over-Confident” model’s spiral falls inside the reference.¶

Comparing Metrics Across Horizons¶

The plot_horizon_metrics() function creates a polar bar chart designed to compare two key metrics across a set of distinct categories, such as different forecast horizons. It’s a powerful tool for visualizing how a model’s uncertainty (bar height) and central tendency (bar color) evolve over time or differ between groups.

First, let’s break down the components of this two-dimensional summary plot.

Plot Anatomy

Angle (θ): Each angular sector represents a distinct category or horizon (e.g., “H+1”, “H+2”), corresponding to a row in the input DataFrame. The labels for these sectors are provided via the xtick_labels parameter.
Radius (r): The height of each bar represents the average value of a primary metric. By default, this is the mean prediction interval width (\(Q_{upper} - Q_{lower}\)).
Color: The color of each bar visualizes a secondary metric. By default, this is the mean of the median (Q50) predictions for that category, adding another layer of information to the comparison.

With this in mind, let’s apply the plot to a classic forecasting problem.

Use Case 1: Standard Forecast Horizon Analysis

The most common use of this plot is to see how a model’s uncertainty and central prediction change as it forecasts further into the future. It’s a typical and expected behavior for uncertainty to grow over longer lead times, and this plot quantifies that drift.

Let’s simulate a multi-step forecast where both the predicted value and its uncertainty increase for each step.

import kdiagram as kd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Data Generation: Multi-step Forecast ---
# Each row represents a forecast horizon (H+1 to H+6)
# Each column is a different sample of that forecast
horizons = ["H+1", "H+2", "H+3", "H+4", "H+5", "H+6"]
df = pd.DataFrame(index=horizons)
q10_cols, q90_cols, q50_cols = [], [], []

for i in range(len(horizons)):
    # Both median and width increase with the horizon
    median = 10 + 5 * i
    width = 5 + 3 * i
    # Create two samples for each horizon
    df[f'q10_s{i}_1'] = median - width/2 + np.random.randn()
    df[f'q90_s{i}_1'] = median + width/2 + np.random.randn()
    df[f'q50_s{i}_1'] = median + np.random.randn()
    df[f'q10_s{i}_2'] = median - width/2 + np.random.randn()
    df[f'q90_s{i}_2'] = median + width/2 + np.random.randn()
    df[f'q50_s{i}_2'] = median + np.random.randn()
    q10_cols.extend([f'q10_s{i}_1', f'q10_s{i}_2'])
    q90_cols.extend([f'q90_s{i}_1', f'q90_s{i}_2'])
    q50_cols.extend([f'q50_s{i}_1', f'q50_s{i}_2'])

# Reshape for the function: rows are horizons, cols are samples
df_horizons = pd.DataFrame(index=horizons)
for i in range(len(horizons)):
    df_horizons.loc[f"H+{i+1}", 'q10_s1'] = df.loc[f"H+{i+1}", f'q10_s{i}_1']
    df_horizons.loc[f"H+{i+1}", 'q90_s1'] = df.loc[f"H+{i+1}", f'q90_s{i}_1']
    df_horizons.loc[f"H+{i+1}", 'q50_s1'] = df.loc[f"H+{i+1}", f'q50_s{i}_1']
    df_horizons.loc[f"H+{i+1}", 'q10_s2'] = df.loc[f"H+{i+1}", f'q10_s{i}_2']
    df_horizons.loc[f"H+{i+1}", 'q90_s2'] = df.loc[f"H+{i+1}", f'q90_s{i}_2']
    df_horizons.loc[f"H+{i+1}", 'q50_s2'] = df.loc[f"H+{i+1}", f'q50_s{i}_2']

# --- 2. Plotting ---
kd.plot_horizon_metrics(
    df=df_horizons,
    qlow_cols=['q10_s1', 'q10_s2'],
    qup_cols=['q90_s1', 'q90_s2'],
    q50_cols=['q50_s1', 'q50_s2'],
    title="Use Case 1: Mean Interval Width Across Horizons",
    xtick_labels=horizons,
    r_label="Mean Interval Width",
    cbar_label="Mean Q50 Value",
    savefig="gallery/images/gallery_horizon_metrics_basic.png"
)
plt.close()

A polar bar chart showing increasing bar height and changing color. — A polar bar chart where both the height of the bars (uncertainty) and their color (median prediction) increase progressively across the forecast horizons.¶

Combined Analysis: Reliability and Horizon Drift¶

Evaluating a sophisticated forecasting model often requires more than a single plot. A comprehensive analysis involves using multiple, complementary visualizations to diagnose different aspects of performance. This tutorial showcases a workflow, combining plot_polar_reliability() and plot_horizon_metrics() to perform a two-part evaluation of a weather forecast.

First, let’s re-introduce the anatomy of the two plots we will be using in our combined analysis.

Plot Anatomy (Polar Reliability)

Angle (θ): Represents the mean predicted probability of an event (e.g., rain), sweeping from 0.0 to 1.0.
Radius (r): Represents the observed frequency of that event.
Reference: The dashed black spiral is the line of perfect calibration. A good model’s curve should follow this spiral.

Plot Anatomy (Horizon Metrics)

Angle (θ): Represents distinct forecast horizons (e.g., “H+6”, “H+12”).
Radius (r): The height of each bar represents the average prediction interval width (uncertainty).
Color: The color of each bar represents the average median (Q50) prediction (e.g., the expected amount of rain).

Now, let’s apply these two diagnostics to a challenging, real-world forecasting problem.

Use Case: A Holistic Evaluation of a Weather Forecast Model

A meteorological agency has a new weather model that produces two key outputs for a 24-hour period:

The probability that it will rain at all (a binary event).
A probabilistic forecast of the total rainfall amount (in mm).

To validate this new model, we need to answer two critical questions:

Is the model reliable? When it predicts a 70% chance of rain, is it trustworthy?
How does its uncertainty grow over time? Is the forecast for rainfall amount sharp and useful for the next 6 hours, but too uncertain for the full 24-hour period?

We will perform a combined analysis by creating a side-by-side plot to answer both questions at once.

Practical Example

import kdiagram as kd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Data Generation ---
np.random.seed(1)
n_days = 1000

# --- Part A: Data for Reliability Plot (Probability of Rain) ---
# True events: It rains on 40% of days
y_true_rain_event = (np.random.rand(n_days) < 0.4).astype(int)
# A slightly over-confident model for predicting the event
y_pred_rain_prob = np.clip(0.4 + np.random.normal(0, 0.3, n_days), 0, 1)

# --- Part B: Data for Horizon Metrics Plot (Amount of Rain) ---
horizons = ["H+6", "H+12", "H+18", "H+24"]
df_horizons = pd.DataFrame(index=horizons)
# For each horizon, we have multiple samples (e.g., from different days)
n_samples = 50
q10_cols, q90_cols, q50_cols = [], [], []

for i in range(len(horizons)):
    # Both median rainfall and uncertainty increase with the horizon
    median = 5 + 5 * i
    width = 3 + 4 * i
    # Create two samples for each horizon
    df_horizons.loc[f"H+{6*(i+1)}", 'q10_s1'] = median - width/2 + np.random.randn()
    df_horizons.loc[f"H+{6*(i+1)}", 'q90_s1'] = median + width/2 + np.random.randn()
    df_horizons.loc[f"H+{6*(i+1)}", 'q50_s1'] = median + np.random.randn()
    df_horizons.loc[f"H+{6*(i+1)}", 'q10_s2'] = median - width/2 + np.random.randn()
    df_horizons.loc[f"H+{6*(i+1)}", 'q90_s2'] = median + width/2 + np.random.randn()
    df_horizons.loc[f"H+{6*(i+1)}", 'q50_s2'] = median + np.random.randn()

# --- 2. Create a figure with two polar subplots ---
fig = plt.figure(figsize=(18, 9))
ax1 = fig.add_subplot(1, 2, 1, projection='polar')
ax2 = fig.add_subplot(1, 2, 2, projection='polar')

# --- 3. Plot each diagnostic on its dedicated axis ---
kd.plot_polar_reliability(
    y_true_rain_event, y_pred_rain_prob,
    ax=ax1,
    names=["Forecast Model"],
    title='Part A: Is the Rain Probability Forecast Reliable?'
)
kd.plot_horizon_metrics(
    df=df_horizons,
    ax=ax2,
    qlow_cols=['q10_s1', 'q10_s2'],
    qup_cols=['q90_s1', 'q90_s2'],
    q50_cols=['q50_s1', 'q50_s2'],
    xtick_labels=horizons,
    title='Part B: How Does Rainfall Uncertainty Evolve?',
    r_label="Mean Interval Width (mm)",
    cbar_label="Mean Predicted Rainfall (mm)"
)

fig.suptitle('Combined Analysis of a Weather Forecast Model', fontsize=18)
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
fig.savefig("gallery/images/gallery_comparison_combined.png")
plt.close(fig)

Side-by-side plots showing reliability and horizon metrics. — A two-panel figure providing a complete model evaluation. The left plot diagnoses the calibration of the rain probability forecast, while the right plot shows how the uncertainty of the rainfall amount forecast grows over time.¶

🧠 Analysis and Interpretation

This combined view provides a comprehensive performance summary that would be impossible to get from a single plot.

The Reliability Spiral on the left diagnoses the model’s ability to predict if it will rain. The model’s curve falls slightly inside the dashed reference spiral, particularly for higher probabilities. This indicates the model is slightly over-confident: when it predicts a high probability of rain, the actual frequency is a bit lower.

The Horizon Metrics plot on the right shows a clear drift in the forecast for rainfall amount. The height of the bars (mean interval width) increases steadily from the 6-hour to the 24-hour forecast, indicating that the model’s uncertainty grows significantly over longer lead times. The color also shifts from blue to red, showing that the median predicted rainfall amount also increases.

Overall Conclusion: By combining these two plots, we can conclude that while the model is slightly over-confident in predicting if it will rain, its primary weakness is a rapid degradation in the precision of its forecast for how much it will rain at longer lead times. This is a critical insight for anyone using this model for operational planning.

For a deeper understanding of the statistical concepts behind these evaluation techniques, please refer back to the main Model Comparison Visualization and Evaluating Probabilistic Forecasts sections.