Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
k-diagram v1.4.3
Logo
k-diagram v1.4.3

Documentation Contents:

  • Installation
  • Quick Start
  • Motivation and Background
  • User Guide
    • Visualizing Forecast Uncertainty
    • Visualizing Forecast Errors
    • Evaluating Probabilistic Forecasts
    • Model Comparison Visualization
    • Visualizing Relationships
    • Feature Importance Visualization
    • Evaluating Classification Models
    • Diagnosing Forecast Anomalies
    • Spatial Diagnostic Plots
    • Taylor Diagrams
    • Contextual Diagnostic Plots
    • Visualizing 1D Distributions
    • Specialized Forecasting Metrics
    • Forecast Utilities
    • Working with Quantile Data
    • Mathematical Utilities
    • Datasets
    • Case Study: Zhongshan Land Subsidence Uncertainty
  • Command-Line Interface (CLI)
    • Introduction - CLI
    • Context plots
    • Evaluation Plots
    • Comparison & Calibration
    • Taylor Diagram
    • Probabilistic Diagnostics
    • Uncertainty & Diagnostics
    • Relationship Commands
    • Error Diagnostics
    • Feature-Based Plots
  • Gallery
    • Uncertainty Visualizations
    • Error Visualizations
    • Probabilistic Diagnostics Gallery
    • Model Comparison Gallery
    • Relationship Visualization
    • Feature-Based Visualization Gallery
    • Anomaly Diagnostics Gallery
    • Spatial Diagnostic Plots Gallery
    • Model Evaluation Gallery
    • Taylor Diagrams
    • Contextual Diagnostic Plots
    • Histogram Gallery
  • API Reference
    • kdiagram.plot.uncertainty.plot_actual_vs_predicted
    • kdiagram.plot.uncertainty.plot_anomaly_magnitude
    • kdiagram.plot.uncertainty.plot_coverage
    • kdiagram.plot.uncertainty.plot_coverage_diagnostic
    • kdiagram.plot.uncertainty.plot_interval_consistency
    • kdiagram.plot.uncertainty.plot_interval_width
    • kdiagram.plot.uncertainty.plot_model_drift
    • kdiagram.plot.uncertainty.plot_temporal_uncertainty
    • kdiagram.plot.uncertainty.plot_uncertainty_drift
    • kdiagram.plot.uncertainty.plot_velocity
    • kdiagram.plot.uncertainty.plot_radial_density_ring
    • kdiagram.plot.uncertainty.plot_polar_heatmap
    • kdiagram.plot.uncertainty.plot_polar_quiver
    • kdiagram.plot.errors.plot_error_bands
    • kdiagram.plot.errors.plot_error_violins
    • kdiagram.plot.errors.plot_error_ellipses
    • kdiagram.plot.probabilistic.plot_pit_histogram
    • kdiagram.plot.probabilistic.plot_polar_sharpness
    • kdiagram.plot.probabilistic.plot_crps_comparison
    • kdiagram.plot.probabilistic.plot_credibility_bands
    • kdiagram.plot.probabilistic.plot_calibration_sharpness
    • kdiagram.plot.comparison.plot_model_comparison
    • kdiagram.plot.comparison.plot_reliability_diagram
    • kdiagram.plot.comparison.plot_polar_reliability
    • kdiagram.plot.comparison.plot_horizon_metrics
    • kdiagram.plot.relationship.plot_relationship
    • kdiagram.plot.relationship.plot_conditional_quantiles
    • kdiagram.plot.relationship.plot_error_relationship
    • kdiagram.plot.relationship.plot_residual_relationship
    • kdiagram.plot.feature_based.plot_feature_fingerprint
    • kdiagram.plot.feature_based.plot_feature_interaction
    • kdiagram.plot.feature_based.plot_fingerprint
    • kdiagram.plot.context.plot_time_series
    • kdiagram.plot.context.plot_scatter_correlation
    • kdiagram.plot.context.plot_error_distribution
    • kdiagram.plot.context.plot_qq
    • kdiagram.plot.context.plot_error_autocorrelation
    • kdiagram.plot.context.plot_error_pacf
    • kdiagram.plot.evaluation.plot_polar_roc
    • kdiagram.plot.evaluation.plot_polar_pr_curve
    • kdiagram.plot.evaluation.plot_polar_confusion_matrix
    • kdiagram.plot.evaluation.plot_polar_confusion_matrix_in
    • kdiagram.plot.evaluation.plot_polar_confusion_multiclass
    • kdiagram.plot.evaluation.plot_polar_classification_report
    • kdiagram.plot.evaluation.plot_pinball_loss
    • kdiagram.plot.evaluation.plot_regression_performance
    • kdiagram.plot.anomaly.plot_anomaly_severity
    • kdiagram.plot.anomaly.plot_anomaly_profile
    • kdiagram.plot.anomaly.plot_glyphs
    • kdiagram.plot.anomaly.plot_cas_layers
    • kdiagram.plot.anomaly.plot_cas_profile
    • kdiagram.plot.taylor_diagram.taylor_diagram
    • kdiagram.plot.taylor_diagram.plot_taylor_diagram_in
    • kdiagram.plot.taylor_diagram.plot_taylor_diagram
    • kdiagram.metrics.cluster_aware_severity_score
    • kdiagram.metrics.clustered_anomaly_severity
    • kdiagram.utils.bin_by_feature
    • kdiagram.utils.build_cdf_interpolator
    • kdiagram.utils.build_q_column_names
    • kdiagram.utils.calculate_calibration_error
    • kdiagram.utils.calculate_probabilistic_scores
    • kdiagram.utils.compute_coverage_score
    • kdiagram.utils.compute_crps
    • kdiagram.utils.compute_forecast_errors
    • kdiagram.utils.compute_interval_width
    • kdiagram.utils.compute_pinball_loss
    • kdiagram.utils.compute_pit
    • kdiagram.utils.compute_winkler_score
    • kdiagram.utils.detect_quantiles_in
    • kdiagram.utils.get_forecast_arrays
    • kdiagram.utils.melt_q_data
    • kdiagram.utils.minmax_scaler
    • kdiagram.utils.pivot_forecasts_long
    • kdiagram.utils.pivot_q_data
    • kdiagram.utils.plot_hist_kde
    • kdiagram.utils.reshape_quantile_data
    • kdiagram.utils.savefig
    • kdiagram.datasets.load_uncertainty_data
    • kdiagram.datasets.load_zhongshan_subsidence
    • kdiagram.datasets.make_cyclical_data
    • kdiagram.datasets.make_fingerprint_data
    • kdiagram.datasets.make_multi_model_quantile_data
    • kdiagram.datasets.make_regression_data
    • kdiagram.datasets.make_classification_data
    • kdiagram.datasets.make_taylor_data
    • kdiagram.datasets.make_uncertainty_data
  • Contributing
  • Code of Conduct
  • Citing k-diagram
  • Release Notes
    • Version 1.4.3
    • Version 1.4.2
    • Version 1.4.1
    • Version 1.4.0
    • Version 1.3.5
    • Version 1.3.2
    • Version 1.3.0
    • Version 1.2.0
    • Version 1.1.0
    • Version 1.0.3
    • Version 1.0.0
  • Development Guide
  • License
  • Glossary
  • References
Back to top
View this page
Edit this page

Model Comparison Gallery¶

This gallery page showcases plots from k-diagram designed for comparing the performance of multiple models across various metrics, primarily using radar charts.

Note

You need to run the code snippets locally to generate the plot images referenced below (e.g., images/gallery_model_comparison.png). Ensure the image paths in the .. image:: directives match where you save the plots (likely an images subdirectory relative to this file).

Multi-Metric Model Comparison¶

The plot_model_comparison() function is a tool for moving beyond single-score evaluations. It creates a polar radar chart to visualize and compare multiple models across several performance metrics simultaneously, providing a holistic “fingerprint” of each model’s strengths and weaknesses.

First, let’s break down the components of this comparative plot.

Plot Anatomy

  • Angle (θ): Each angular axis represents a different performance metric (e.g., R², MAE, Training Time).

  • Radius (r): Corresponds to the normalized performance score for that metric, typically scaled to the range [0, 1]. To maintain consistency, all metrics are scaled such that a larger radius is always better (e.g., lower MAE or faster training time results in a larger radius).

  • Polygon: Each colored polygon represents a model, with its vertices showing its performance on each metric. The overall shape and size of the polygon provide an at-a-glance summary of the model’s performance profile.

With this framework, we can now apply the plot to a real-world model selection problem, progressing from a standard regression task to a more nuanced classification task.


Use Case 1: Standard Regression Model Comparison

The most common use for this plot is to select the best model for a standard regression task by balancing accuracy, error, and efficiency.

Let’s imagine an analytics team at an e-commerce company has built three different models to predict sales revenue: a fast but simple Ridge regression, a Lasso model that performs feature selection, and a more complex Decision Tree. They need to choose the best all-around performer.

 1import kdiagram as kd
 2import numpy as np
 3import matplotlib.pyplot as plt
 4
 5# --- 1. Data Generation: Sales Revenue Forecast ---
 6np.random.seed(42)
 7n_samples = 100
 8y_true_reg = np.random.rand(n_samples) * 50 + 10 # True revenue
 9# Model 1 (Ridge): Good fit, fast
10y_pred_r1 = y_true_reg + np.random.normal(0, 4, n_samples)
11# Model 2 (Lasso): Similar fit, slightly slower
12y_pred_r2 = y_true_reg * 0.98 + 1 + np.random.normal(0, 4.5, n_samples)
13# Model 3 (Tree): Overfit, slower, poor on some metrics
14y_pred_r3 = y_true_reg + np.random.normal(2, 8, n_samples)
15
16times = [0.1, 0.3, 0.8] # Training times in seconds
17names = ['Ridge', 'Lasso', 'Tree']
18
19# --- 2. Plotting ---
20# Using default regression metrics: ['r2', 'mae', 'mape', 'rmse']
21kd.plot_model_comparison(
22    y_true_reg,
23    y_pred_r1, y_pred_r2, y_pred_r3,
24    train_times=times,
25    names=names,
26    title="Use Case 1: E-Commerce Sales Model Comparison",
27    scale='norm',
28    savefig="gallery/images/gallery_model_comparison_regression.png"
29)
30plt.close()
A radar chart comparing three regression models.

A radar chart showing the performance profiles of Ridge, Lasso, and Decision Tree models across five different metrics.¶

🧠 Analysis and Interpretation

This plot reveals a classic trade-off between performance and efficiency. The Ridge model (blue) is the clear winner on all predictive performance metrics (r2, mae, mape, and rmse), as its polygon has the largest overall area and extends furthest on these axes. However, the Lasso model (orange), while slightly less accurate, is the fastest to train, as shown by its superior score on the Train Time (s) axis. The Tree model is not visible, indicating its performance was the lowest on all metrics. The choice is clear: use the Ridge model for the highest accuracy, or the Lasso model for a good balance of speed and performance.


Use Case 2: Evaluating a Classification Task with a Custom Metric

This plot is equally usefull for classification. The default metrics will automatically switch to [‘accuracy’, ‘precision’, ‘recall’, ‘f1’], but we can also provide our own custom metrics to evaluate performance on criteria that are specific to our business problem.

Let’s consider a medical diagnosis model that predicts whether a patient has a rare disease. In this case, Recall (correctly identifying sick patients) is far more important than Precision. We can create a custom, weighted F-beta score to reflect this and add it to our plot.

 1from sklearn.metrics import fbeta_score
 2
 3# --- 1. Data Generation: Medical Diagnosis ---
 4np.random.seed(0)
 5n_samples = 200
 6y_true_clf = np.array([0] * 180 + [1] * 20) # Imbalanced data
 7# Model A: High precision, but misses sick patients (low recall)
 8y_pred_A = np.copy(y_true_clf)
 9y_pred_A[np.random.choice(np.where(y_true_clf==1)[0], 12, False)] = 0
10# Model B: Lower precision, but finds most sick patients (high recall)
11y_pred_B = np.copy(y_true_clf)
12y_pred_B[np.random.choice(np.where(y_true_clf==0)[0], 20, False)] = 1
13
14# --- 2. Define a custom metric that prioritizes Recall ---
15# An F-beta score with beta=2 weighs recall higher than precision
16f2_score = lambda y_true, y_pred: fbeta_score(y_true, y_pred, beta=2)
17f2_score.__name__ = "F2-Score (Recall Focus)" # Give it a nice name for the plot
18
19# --- 3. Plotting with default and custom metrics ---
20kd.plot_model_comparison(
21    y_true_clf,
22    y_pred_A,
23    y_pred_B,
24    names=['Model A (High Precision)', 'Model B (High Recall)'],
25    metrics=['accuracy', 'precision', 'recall', f2_score], # Add our custom metric
26    title="Use Case 2: Medical Diagnosis Classifier Comparison",
27    scale='norm',
28    savefig="gallery/images/gallery_model_comparison_classification.png"
29)
30plt.close()
A radar chart comparing two classification models with a custom metric.

A radar chart showing how two classifiers perform on standard metrics as well as a custom “F2-Score” that prioritizes recall.¶

🧠 Interpretation

The radar chart illustrates a stark and mutually exclusive trade-off between the two classifiers. Model A (blue) achieves perfect scores on the accuracy and precision axes but completely fails on recall and our custom F2-Score, with scores of zero. Conversely, Model B (orange) shows the exact opposite profile: it scores perfectly on recall and the F2-Score but fails completely on accuracy and precision.

For a medical diagnosis where failing to identify a sick patient (low recall) is a critical error, Model B is the only viable choice. The custom F2-Score (Recall Focus) axis correctly identifies it as the superior model for this specific, high-stakes application.

Best Practice

Don’t rely solely on default metrics. For real-world problems, business needs often dictate that some errors are more costly than others. Adding custom metrics to the plot_model_comparison function, as shown in this use case, is a powerful way to ensure your model evaluation aligns with your specific goals.


For a deeper understanding of the statistical concepts behind these evaluation metrics, please refer back to the main Multi-Metric Model Comparison (plot_model_comparison()) section.

Model Reliability (Calibration) Diagram¶

The plot_reliability_diagram() is the industry-standard tool for assessing the calibration of a binary classifier. It answers a crucial question: “When my model predicts a 70% probability of an event, does that event actually happen 70% of the time?” A model whose probabilities accurately reflect real-world frequencies is considered “well-calibrated” and is essential for making trustworthy, risk-based decisions.

Let’s begin by breaking down the components of this fundamental plot.

Plot Anatomy

  • X-Axis (Mean Predicted Probability): For each bin, this is the average of the probabilities predicted by the model. This is also referred to as the forecast’s confidence.

  • Y-Axis (Observed Frequency): For each bin, this is the actual fraction of positive cases observed in the data. This is also referred to as the forecast’s accuracy.

  • Diagonal Line (\(y=x\)): This is the line of perfect calibration. A model whose points fall on this line is perfectly calibrated.

  • Counts Panel (Bottom): A histogram showing the number of predictions that fall into each probability bin, which helps in diagnosing if the model is timid (most predictions near 0.5) or decisive (most predictions near 0 or 1).

With this in mind, let’s explore how to use this plot to diagnose and compare the reliability of different models.


Use Case 1: Basic Calibration Check with Uniform Bins

The most common use case is to get a quick, initial assessment of a single model’s calibration. For this, we can use the default uniform binning strategy, which creates equally spaced bins across the [0, 1] probability range.

Let’s evaluate a model trained to predict customer churn, where a “1” means the customer is likely to leave.

 1import kdiagram as kd
 2import numpy as np
 3import matplotlib.pyplot as plt
 4
 5# --- 1. Data Generation: Customer Churn Predictions ---
 6np.random.seed(0)
 7n_customers = 2000
 8# True outcome: ~30% of customers churn
 9y_true = (np.random.rand(n_customers) < 0.3).astype(int)
10# A reasonably good, but not perfect, model
11y_pred = np.clip(y_true * 0.4 + 0.3 + np.random.normal(0, 0.15, n_customers), 0.01, 0.99)
12
13# --- 2. Plotting ---
14kd.plot_reliability_diagram(
15    y_true, y_pred,
16    names=['Churn Model'],
17    n_bins=10,
18    strategy="uniform", # Default, but explicit here
19    title='Use Case 1: Basic Calibration Check',
20    savefig="gallery/images/gallery_reliability_diagram_basic.png"
21)
22plt.close()
A basic reliability diagram showing a single model's calibration.

A reliability diagram showing the model’s calibration curve relative to the perfect diagonal. The counts panel below shows the distribution of its predictions.¶

🧠 Analysis and Interpretation

This plot provides a clear, initial diagnosis. The model’s calibration curve (blue line) generally follows the dashed diagonal reference line, suggesting it is reasonably well-calibrated. However, for higher predicted probabilities (confidence > 0.6), the curve dips slightly below the diagonal, indicating a tendency towards over-confidence in this range—when it is highly confident that a customer will churn, the actual churn rate is slightly lower. The counts panel at the bottom shows that the model is quite decisive, with most of its predictions falling into the bins near 0.2 and 0.7.


Use Case 2: Comparing Models with Quantile Binning

A more advanced task is to compare the reliability of multiple competing models. For this, quantile binning is often superior to uniform binning, as it ensures that each bin contains an equal number of samples, providing a more stable estimate of the observed frequency.

Let’s compare our “Churn Model” to a new “Calibrated Model” that has been post-processed to improve its reliability.

 1# --- 1. Data Generation (uses y_true and y_pred from previous step) ---
 2# Create a second, better-calibrated model's predictions
 3y_pred_calibrated = np.clip(y_true * 0.35 + 0.32 + np.random.normal(0, 0.1, n_customers), 0.01, 0.99)
 4
 5# --- 2. Plotting ---
 6kd.plot_reliability_diagram(
 7    y_true, y_pred, y_pred_calibrated,
 8    names=['Original Model', 'Calibrated Model'],
 9    n_bins=12,
10    strategy="quantile", # Use quantile binning for a stable comparison
11    error_bars="wilson",  # Add Wilson confidence intervals
12    title='Use Case 2: Comparing Model Reliability',
13    savefig="gallery/images/gallery_reliability_diagram_compare.png"
14)
15plt.close()
A reliability diagram comparing two models using quantile binning.

Two calibration curves are shown. The “Calibrated Model” (orange) hugs the diagonal line more closely than the “Original Model” (blue).¶

🧠 Interpretation

This side-by-side comparison on the same axes reveals the distinct calibration profiles of the two models. The Original Model (blue) clearly deviates from the diagonal, exhibiting significant under-confidence for predicted probabilities between 0.4 and 0.6. The “Calibrated Model” (orange) shows a different pattern of miscalibration, with a noticeable “S” shape where it is first under-confident and then over-confident.

Interestingly, the quantitative metrics in the legend confirm this visual assessment: the attempted calibration was not successful in this case, as the “Calibrated Model” has a slightly worse (higher) ECE score than the original. This is a perfect example of why reliability diagrams are so crucial—they provide a nuanced diagnostic that goes beyond simple labels and reveals the true behavior of a model’s probability outputs.

Best Practice

When comparing multiple models, using strategy="quantile" is highly recommended. It prevents bins from being empty and provides more stable and reliable estimates of the observed frequencies, leading to a fairer comparison between models. Also, including error bars (e.g., error_bars="wilson") provides crucial context about the statistical uncertainty of your assessment.

See Also

For an alternative, and often more intuitive, way to visualize model calibration, see the plot_polar_reliability() function. It transforms this Cartesian plot into a polar spiral, which can make miscalibration patterns even easier to spot.


Use Case 3: Weighted Calibration for High-Value Segments

In many real-world business problems, not all prediction errors are created equal. An error on a high-value customer can be far more costly than an error on a standard customer. A model might appear well-calibrated overall, but this aggregate view can hide poor performance on the most critical segments. The sample_weight parameter is a powerful tool for diagnosing this exact problem.

Best Practice

When the business impact of your model’s predictions is not uniform across all samples, always perform a weighted calibration analysis. Use the sample_weight parameter to assign higher importance to high-value customers, critical events, or costly failure modes to ensure your model is reliable where it matters most.

Let’s tackle a common problem in customer retention: ensuring our churn model is reliable for our most valuable “premium” subscribers.

Practical Example

A streaming service uses a model to predict the probability that a subscriber will churn (cancel their subscription). The model’s overall calibration appears to be good. However, the business is most concerned about retaining its “premium” subscribers, as they account for a disproportionate amount of revenue. Is the model’s churn probability trustworthy specifically for this high-value segment?

We will create a side-by-side comparison. The left plot will show the standard, unweighted reliability, while the right plot will use sample_weight to give 10x more importance to the premium subscribers, revealing the model’s true performance for this critical group.

 1import kdiagram as kd
 2import numpy as np
 3import pandas as pd
 4import matplotlib.pyplot as plt
 5
 6# --- 1. Data Generation: Churn with a high-value segment ---
 7np.random.seed(10)
 8n_customers = 5000
 9# True churn status
10y_true = (np.random.rand(n_customers) < 0.2).astype(int)
11# Create sample weights: 10% are "premium" customers with 10x weight
12sample_weight = np.ones(n_customers)
13premium_indices = np.random.choice(n_customers, 500, replace=False)
14sample_weight[premium_indices] = 10
15
16# --- 2. Create biased predictions FOR THE PREMIUM SEGMENT ---
17# The model is well-calibrated for standard users but overconfident
18# for premium users (predicts lower churn probability than is real)
19y_pred = np.clip(y_true * 0.5 + 0.15 + np.random.normal(0, 0.1, n_customers), 0.01, 0.99)
20# Introduce the bias for the premium segment
21y_pred[premium_indices] *= 0.5
22
23# --- 3. Create side-by-side plots ---
24fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))
25
26kd.plot_reliability_diagram(
27    y_true, y_pred,
28    ax=ax1,
29    names=['Churn Model'],
30    title='Use Case 3a: Unweighted Reliability (All Customers)',
31    savefig=None # Prevent saving from the first call
32)
33kd.plot_reliability_diagram(
34    y_true, y_pred,
35    ax=ax2,
36    sample_weight=sample_weight, # Apply the crucial sample weights
37    names=['Churn Model'],
38    title='Use Case 3b: Weighted Reliability (Premium Focus)',
39    savefig=None # Prevent saving from the second call
40)
41
42fig.suptitle('Diagnosing Hidden Bias with Weighted Calibration', fontsize=16)
43fig.tight_layout(rect=[0, 0, 1, 0.95])
44fig.savefig("gallery/images/gallery_reliability_diagram_weighted.png")
45plt.close(fig)
Side-by-side reliability diagrams, one unweighted and one weighted.

A two-panel figure. The left plot (unweighted) shows a reasonably well-calibrated model. The right plot (weighted by customer value) reveals the same model is severely overconfident for its most important customers.¶

🧠 Interpretation

This side-by-side comparison reveals a critical, hidden flaw that would be missed by a standard analysis. The Unweighted Reliability plot (left) suggests the model is acceptable. Because the premium subscribers are only 10% of the data, their poor calibration is masked by the good performance on the majority of standard users.

However, the Weighted Reliability plot (right) tells a completely different and more important story. By giving more weight to the premium segment, the curve is now dragged far below the diagonal. This shows that for high-value customers, the model is severely overconfident. It consistently underestimates their churn risk, which could lead the business to neglect retention efforts for its most important user base. This analysis demonstrates that the model is not yet fit for its intended business purpose.


For a deeper understanding of the statistical theory behind calibration and proper scoring rules, please refer back to the main Reliability Diagram (plot_reliability_diagram()) section.


Polar Reliability Diagram (Calibration Spiral)¶

The plot_polar_reliability() function provides a novel and highly intuitive visualization of model calibration. It transforms the traditional reliability diagram into a “calibration spiral,” where deviations from a perfect spiral immediately reveal the nature and location of a model’s miscalibrations through diagnostic coloring.

First, let’s break down the components of this innovative plot.

Plot Anatomy

  • Angle (θ): Represents the mean predicted probability (\(\bar{p}_k\)) for each bin, sweeping from 0.0 at 0° to 1.0 at 90°. This is the model’s confidence.

  • Radius (r): Represents the observed frequency of the event (\(\bar{y}_k\)) for each bin. This is the actual outcome.

  • Perfect Calibration Spiral: The dashed black line represents the ideal case where \(r = \frac{2\theta}{\pi}\) (\(\bar{y}_k = \bar{p}_k\)). A model’s spiral should lie directly on this line.

  • Color: The color of the model’s spiral is a diagnostic tool, representing the calibration error (\(\bar{y}_k - \bar{p}_k\)). Colors on one side of the colormap’s center (e.g., reds) indicate over-confidence, while colors on the other side (e.g., blues) indicate under-confidence.

With this in mind, let’s apply the plot to a real-world problem to see how it uncovers different types of miscalibration.


Use Case 1: Diagnosing an Over-Confident Model

A common failure mode for classifiers, especially on complex tasks, is overconfidence. The model assigns high probabilities to its predictions, but its real-world accuracy doesn’t match this high level of certainty.

Let’s simulate a scenario in medical diagnostics, where a model is trained to predict the probability of a disease. An overconfident model might predict a 90% probability of disease when the actual rate for such patients is only 70%, which could lead to unnecessary treatments.

 1import kdiagram as kd
 2import numpy as np
 3import matplotlib.pyplot as plt
 4
 5# --- 1. Data Generation: A Well-Calibrated vs. Over-Confident Model ---
 6np.random.seed(0)
 7n_samples = 2000
 8# The true probability of the event is 0.4
 9y_true = (np.random.rand(n_samples) < 0.4).astype(int)
10
11# A well-calibrated model's probabilities are realistic
12calibrated_preds = np.clip(0.4 + np.random.normal(0, 0.15, n_samples), 0, 1)
13
14# An over-confident model pushes probabilities towards the extremes of 0 and 1
15overconfident_preds = np.clip(0.4 + np.random.normal(0, 0.3, n_samples), 0, 1)
16
17# --- 2. Plotting ---
18kd.plot_polar_reliability(
19    y_true,
20    calibrated_preds,
21    overconfident_preds,
22    names=["Well-Calibrated", "Over-Confident"],
23    n_bins=15,
24    cmap='coolwarm',
25    title="Use Case 1: Diagnosing an Over-Confident Model",
26    savefig="gallery/images/gallery_polar_reliability_overconfident.png"
27)
28plt.close()
A polar reliability diagram showing one well-calibrated and one over-confident model.

The “Well-Calibrated” model’s spiral closely follows the dashed reference line, while the “Over-Confident” model’s spiral falls inside the reference.¶

🧠 Analysis and Interpretation

This plot makes the models’ behaviors easy to distinguish. The “Well-Calibrated” model’s spiral (not shown with a separate legend entry but represented by the line segments colored near the neutral center of the colormap) adheres very closely to the dashed “Perfect Calibration” spiral. This is the signature of a reliable model.

In stark contrast, the “Over-Confident” model’s spiral deviates significantly. In the region of higher predicted probabilities (larger angles), its spiral falls inside the dashed reference line, and the diagnostic coloring turns red. This is a clear visual indication of over-confidence: the observed frequency (radius) is systematically lower than the predicted probability (angle).

See Also

This plot is the polar-coordinate counterpart to the traditional Cartesian plot_reliability_diagram(). While both show the same underlying data, the spiral format can often make deviations and the nature of miscalibration more intuitive to see at a glance.


For a deeper understanding of the statistical theory behind calibration and reliability, please refer back to the main Polar Reliability Diagram (plot_polar_reliability()) section.

Comparing Metrics Across Horizons¶

The plot_horizon_metrics() function creates a polar bar chart designed to compare two key metrics across a set of distinct categories, such as different forecast horizons. It’s a powerful tool for visualizing how a model’s uncertainty (bar height) and central tendency (bar color) evolve over time or differ between groups.

First, let’s break down the components of this two-dimensional summary plot.

Plot Anatomy

  • Angle (θ): Each angular sector represents a distinct category or horizon (e.g., “H+1”, “H+2”), corresponding to a row in the input DataFrame. The labels for these sectors are provided via the xtick_labels parameter.

  • Radius (r): The height of each bar represents the average value of a primary metric. By default, this is the mean prediction interval width (\(Q_{upper} - Q_{lower}\)).

  • Color: The color of each bar visualizes a secondary metric. By default, this is the mean of the median (Q50) predictions for that category, adding another layer of information to the comparison.

With this in mind, let’s apply the plot to a classic forecasting problem.


Use Case 1: Standard Forecast Horizon Analysis

The most common use of this plot is to see how a model’s uncertainty and central prediction change as it forecasts further into the future. It’s a typical and expected behavior for uncertainty to grow over longer lead times, and this plot quantifies that drift.

Let’s simulate a multi-step forecast where both the predicted value and its uncertainty increase for each step.

 1import kdiagram as kd
 2import pandas as pd
 3import numpy as np
 4import matplotlib.pyplot as plt
 5
 6# --- 1. Data Generation: Multi-step Forecast ---
 7# Each row represents a forecast horizon (H+1 to H+6)
 8# Each column is a different sample of that forecast
 9horizons = ["H+1", "H+2", "H+3", "H+4", "H+5", "H+6"]
10df = pd.DataFrame(index=horizons)
11q10_cols, q90_cols, q50_cols = [], [], []
12
13for i in range(len(horizons)):
14    # Both median and width increase with the horizon
15    median = 10 + 5 * i
16    width = 5 + 3 * i
17    # Create two samples for each horizon
18    df[f'q10_s{i}_1'] = median - width/2 + np.random.randn()
19    df[f'q90_s{i}_1'] = median + width/2 + np.random.randn()
20    df[f'q50_s{i}_1'] = median + np.random.randn()
21    df[f'q10_s{i}_2'] = median - width/2 + np.random.randn()
22    df[f'q90_s{i}_2'] = median + width/2 + np.random.randn()
23    df[f'q50_s{i}_2'] = median + np.random.randn()
24    q10_cols.extend([f'q10_s{i}_1', f'q10_s{i}_2'])
25    q90_cols.extend([f'q90_s{i}_1', f'q90_s{i}_2'])
26    q50_cols.extend([f'q50_s{i}_1', f'q50_s{i}_2'])
27
28# Reshape for the function: rows are horizons, cols are samples
29df_horizons = pd.DataFrame(index=horizons)
30for i in range(len(horizons)):
31    df_horizons.loc[f"H+{i+1}", 'q10_s1'] = df.loc[f"H+{i+1}", f'q10_s{i}_1']
32    df_horizons.loc[f"H+{i+1}", 'q90_s1'] = df.loc[f"H+{i+1}", f'q90_s{i}_1']
33    df_horizons.loc[f"H+{i+1}", 'q50_s1'] = df.loc[f"H+{i+1}", f'q50_s{i}_1']
34    df_horizons.loc[f"H+{i+1}", 'q10_s2'] = df.loc[f"H+{i+1}", f'q10_s{i}_2']
35    df_horizons.loc[f"H+{i+1}", 'q90_s2'] = df.loc[f"H+{i+1}", f'q90_s{i}_2']
36    df_horizons.loc[f"H+{i+1}", 'q50_s2'] = df.loc[f"H+{i+1}", f'q50_s{i}_2']
37
38# --- 2. Plotting ---
39kd.plot_horizon_metrics(
40    df=df_horizons,
41    qlow_cols=['q10_s1', 'q10_s2'],
42    qup_cols=['q90_s1', 'q90_s2'],
43    q50_cols=['q50_s1', 'q50_s2'],
44    title="Use Case 1: Mean Interval Width Across Horizons",
45    xtick_labels=horizons,
46    r_label="Mean Interval Width",
47    cbar_label="Mean Q50 Value",
48    savefig="gallery/images/gallery_horizon_metrics_basic.png"
49)
50plt.close()
A polar bar chart showing increasing bar height and changing color.

A polar bar chart where both the height of the bars (uncertainty) and their color (median prediction) increase progressively across the forecast horizons.¶

🧠 Analysis and Interpretation

This plot provides a two-dimensional summary of the forecast’s drift. The bar height (radius) clearly increases as we move clockwise from horizon “H+1” to “H+6”. This is a direct visualization of growing uncertainty; the model’s average prediction interval width gets larger as it forecasts further into the future. Simultaneously, the color of the bars shifts from blue (lower values) to red (higher values), showing that the model’s central prediction (the mean Q50 value) is also trending upwards across the horizons.

See Also

This plot is closely related to plot_model_drift(). While both visualize drift over horizons with polar bars, this function is more general-purpose. It can be used to compare any set of distinct categories (not just time horizons) and offers more direct control over the data columns used for the radius and color calculations.


For a deeper understanding of the statistical concepts behind analyzing forecasts over different horizons, please refer back to the main Comparing Metrics Across Horizons (plot_horizon_metrics()) section.

Combined Analysis: Reliability and Horizon Drift¶

Evaluating a sophisticated forecasting model often requires more than a single plot. A comprehensive analysis involves using multiple, complementary visualizations to diagnose different aspects of performance. This tutorial showcases a workflow, combining plot_polar_reliability() and plot_horizon_metrics() to perform a two-part evaluation of a weather forecast.

First, let’s re-introduce the anatomy of the two plots we will be using in our combined analysis.

Plot Anatomy (Polar Reliability)

  • Angle (θ): Represents the mean predicted probability of an event (e.g., rain), sweeping from 0.0 to 1.0.

  • Radius (r): Represents the observed frequency of that event.

  • Reference: The dashed black spiral is the line of perfect calibration. A good model’s curve should follow this spiral.

Plot Anatomy (Horizon Metrics)

  • Angle (θ): Represents distinct forecast horizons (e.g., “H+6”, “H+12”).

  • Radius (r): The height of each bar represents the average prediction interval width (uncertainty).

  • Color: The color of each bar represents the average median (Q50) prediction (e.g., the expected amount of rain).

Now, let’s apply these two diagnostics to a challenging, real-world forecasting problem.


Use Case: A Holistic Evaluation of a Weather Forecast Model

A meteorological agency has a new weather model that produces two key outputs for a 24-hour period:

  1. The probability that it will rain at all (a binary event).

  2. A probabilistic forecast of the total rainfall amount (in mm).

To validate this new model, we need to answer two critical questions:

  • Is the model reliable? When it predicts a 70% chance of rain, is it trustworthy?

  • How does its uncertainty grow over time? Is the forecast for rainfall amount sharp and useful for the next 6 hours, but too uncertain for the full 24-hour period?

We will perform a combined analysis by creating a side-by-side plot to answer both questions at once.

Practical Example

 1import kdiagram as kd
 2import pandas as pd
 3import numpy as np
 4import matplotlib.pyplot as plt
 5
 6# --- 1. Data Generation ---
 7np.random.seed(1)
 8n_days = 1000
 9
10# --- Part A: Data for Reliability Plot (Probability of Rain) ---
11# True events: It rains on 40% of days
12y_true_rain_event = (np.random.rand(n_days) < 0.4).astype(int)
13# A slightly over-confident model for predicting the event
14y_pred_rain_prob = np.clip(0.4 + np.random.normal(0, 0.3, n_days), 0, 1)
15
16# --- Part B: Data for Horizon Metrics Plot (Amount of Rain) ---
17horizons = ["H+6", "H+12", "H+18", "H+24"]
18df_horizons = pd.DataFrame(index=horizons)
19# For each horizon, we have multiple samples (e.g., from different days)
20n_samples = 50
21q10_cols, q90_cols, q50_cols = [], [], []
22
23for i in range(len(horizons)):
24    # Both median rainfall and uncertainty increase with the horizon
25    median = 5 + 5 * i
26    width = 3 + 4 * i
27    # Create two samples for each horizon
28    df_horizons.loc[f"H+{6*(i+1)}", 'q10_s1'] = median - width/2 + np.random.randn()
29    df_horizons.loc[f"H+{6*(i+1)}", 'q90_s1'] = median + width/2 + np.random.randn()
30    df_horizons.loc[f"H+{6*(i+1)}", 'q50_s1'] = median + np.random.randn()
31    df_horizons.loc[f"H+{6*(i+1)}", 'q10_s2'] = median - width/2 + np.random.randn()
32    df_horizons.loc[f"H+{6*(i+1)}", 'q90_s2'] = median + width/2 + np.random.randn()
33    df_horizons.loc[f"H+{6*(i+1)}", 'q50_s2'] = median + np.random.randn()
34
35# --- 2. Create a figure with two polar subplots ---
36fig = plt.figure(figsize=(18, 9))
37ax1 = fig.add_subplot(1, 2, 1, projection='polar')
38ax2 = fig.add_subplot(1, 2, 2, projection='polar')
39
40# --- 3. Plot each diagnostic on its dedicated axis ---
41kd.plot_polar_reliability(
42    y_true_rain_event, y_pred_rain_prob,
43    ax=ax1,
44    names=["Forecast Model"],
45    title='Part A: Is the Rain Probability Forecast Reliable?'
46)
47kd.plot_horizon_metrics(
48    df=df_horizons,
49    ax=ax2,
50    qlow_cols=['q10_s1', 'q10_s2'],
51    qup_cols=['q90_s1', 'q90_s2'],
52    q50_cols=['q50_s1', 'q50_s2'],
53    xtick_labels=horizons,
54    title='Part B: How Does Rainfall Uncertainty Evolve?',
55    r_label="Mean Interval Width (mm)",
56    cbar_label="Mean Predicted Rainfall (mm)"
57)
58
59fig.suptitle('Combined Analysis of a Weather Forecast Model', fontsize=18)
60fig.tight_layout(rect=[0, 0.03, 1, 0.95])
61fig.savefig("gallery/images/gallery_comparison_combined.png")
62plt.close(fig)
Side-by-side plots showing reliability and horizon metrics.

A two-panel figure providing a complete model evaluation. The left plot diagnoses the calibration of the rain probability forecast, while the right plot shows how the uncertainty of the rainfall amount forecast grows over time.¶

🧠 Analysis and Interpretation

This combined view provides a comprehensive performance summary that would be impossible to get from a single plot.

The Reliability Spiral on the left diagnoses the model’s ability to predict if it will rain. The model’s curve falls slightly inside the dashed reference spiral, particularly for higher probabilities. This indicates the model is slightly over-confident: when it predicts a high probability of rain, the actual frequency is a bit lower.

The Horizon Metrics plot on the right shows a clear drift in the forecast for rainfall amount. The height of the bars (mean interval width) increases steadily from the 6-hour to the 24-hour forecast, indicating that the model’s uncertainty grows significantly over longer lead times. The color also shifts from blue to red, showing that the median predicted rainfall amount also increases.

Overall Conclusion: By combining these two plots, we can conclude that while the model is slightly over-confident in predicting if it will rain, its primary weakness is a rapid degradation in the precision of its forecast for how much it will rain at longer lead times. This is a critical insight for anyone using this model for operational planning.


For a deeper understanding of the statistical concepts behind these evaluation techniques, please refer back to the main Model Comparison Visualization and Evaluating Probabilistic Forecasts sections.

Next
Relationship Visualization
Previous
Probabilistic Diagnostics Gallery
Copyright © 2026, Laurent Kouadio
Made with Sphinx and @pradyunsg's Furo
On this page
  • Model Comparison Gallery
    • Multi-Metric Model Comparison
    • Model Reliability (Calibration) Diagram
    • Polar Reliability Diagram (Calibration Spiral)
    • Comparing Metrics Across Horizons
    • Combined Analysis: Reliability and Horizon Drift