Model Comparison Visualization¶
Comparing the performance of different forecasting or simulation models is a common task in model development and selection. Often, evaluation requires looking at multiple performance metrics simultaneously to understand the trade-offs and overall suitability of each model for a specific application.
The kdiagram.plot.comparison module provides tools specifically
for this purpose, currently featuring radar charts for multi-metric,
multi-model comparisons.
Summary of Comparison Functions¶
Function |
Description |
|---|---|
Generates a radar chart comparing multiple models across various performance metrics (e.g., R2, MAE, Accuracy). |
|
Draws a reliability (calibration) diagram to assess how well predicted probabilities match observed frequencies. |
|
Draw a polar bar chart to visually compare key metrics across a set of distinct categories. |
Detailed Explanations¶
Let’s explore the model comparison function.
Multi-Metric Model Comparison (plot_model_comparison())¶
Purpose: This function generates a radar chart (also known as a spider or star chart) to visually compare the performance of multiple models across multiple evaluation metrics simultaneously. It provides a holistic snapshot of model strengths and weaknesses, making it easier to select the best model based on criteria beyond a single score. Optionally, training time can be included as an additional comparison axis.
Mathematical Concept:
For each model \(k\) (with predictions \(\hat{y}_k\)) and each chosen metric \(m\), a score \(S_{m,k}\) is calculated using the true values \(y_{true}\):
The metrics used can be standard ones (like R2, MAE, Accuracy, F1) or custom functions. If train_times are provided, they are treated as another dimension.
The scores for each metric \(m\) are typically scaled across the models (using scale=’norm’ for Min-Max or scale=’std’ for Standard Scaling) before plotting, to bring potentially different metric ranges onto a comparable radial axis:
Each metric \(m\) is assigned an angle \(\theta_m\) on the radar chart, and the scaled score \(S'_{m,k}\) determines the radial distance along that axis for model \(k\). These points are connected to form a polygon representing each model’s overall performance profile.
Interpretation:
Axes: Each axis radiating from the center represents a different performance metric (e.g., ‘r2’, ‘mae’, ‘accuracy’, ‘train_time_s’).
Polygons: Each colored polygon corresponds to a different model, as indicated by the legend.
Radius: The distance from the center along a metric’s axis shows the model’s (potentially scaled) score for that metric.
Important: By default (scale=’norm’ with internal inversion for error metrics), a larger radius generally indicates better performance (higher score for accuracy/R2, lower score for MAE/RMSE/MAPE/time after inversion during scaling). Check the scale parameter used. If scale=None, interpret radius based on the raw metric values.
Shape Comparison: Compare the overall shapes and sizes of the polygons. A model with a consistently large polygon across multiple desirable metrics might be considered the best overall performer. Different shapes highlight trade-offs (e.g., one model might excel in R2 but be slow, while another is fast but has lower R2).
Use Cases:
Multi-Objective Model Selection: Choose the best model when performance needs to be balanced across several, potentially conflicting, metrics (e.g., high accuracy vs. low error vs. fast training time).
Visualizing Strengths/Weaknesses: Quickly identify which metrics a particular model excels or struggles with compared to others.
Communicating Comparative Performance: Provide stakeholders with an intuitive visual summary of how different candidate models stack up against each other based on chosen criteria.
Comparing Regression and Classification: Use appropriate default or custom metrics to compare models for either task type.
Advantages (Radar Context):
Effectively displays multiple performance dimensions (>2) for multiple entities (models) in a single, relatively compact plot.
Allows direct comparison of the profiles of different models – are they generally good/bad, or strong in some areas and weak in others?
Facilitates the identification of trade-offs between different metrics.
Example: (See the Model Comparison Example in the Gallery)
Reliability Diagram (plot_reliability_diagram())¶
Purpose: This function draws a reliability (calibration) diagram, a standard method in forecast verification [1], to assess how well predicted probabilities match observed frequencies. It supports one or many models on the same figure, multiple binning strategies, optional error bars (e.g., Wilson intervals), and a counts panel for diagnosing data sparsity across probability ranges.
Mathematical Concept: Given binary labels \(y_j \in \{0,1\}\) and predicted probabilities \(p_j \in [0,1]\) (optionally with per-sample weights \(w_j \ge 0\)), probabilities are partitioned into bins via a binning rule \(b(\cdot)\) (uniform or quantile).
For bin \(i\), define the (weighted) bin weight
Within each bin, compute the mean confidence (x–axis) and observed frequency (y–axis):
Each bin yields a point \((\mathrm{conf}_i, \mathrm{acc}_i)\). A perfectly calibrated model satisfies \(\mathrm{acc}_i \approx \mathrm{conf}_i\) for all bins, i.e., points lie on the diagonal \(y=x\).
Uncertainty in observed frequency. When \(W_i\) is sufficiently large, a normal approximation can be used for \(\mathrm{acc}_i\) with standard error
Alternatively, the Wilson interval (95%) for a binomial proportion with \(z = 1.96\) provides a more stable interval, especially for small counts:
(With sample weights, \(n\) is treated as an effective count.)
Aggregate calibration metrics.
Expected Calibration Error (ECE) (L1 form):
(8)¶\[\mathrm{ECE} \;=\; \sum_{i} \frac{W_i}{W} \;\big|\mathrm{acc}_i - \mathrm{conf}_i\big|.\]Maximum Calibration Error (MCE) (optional concept):
(9)¶\[\mathrm{MCE} \;=\; \max_i \;\big|\mathrm{acc}_i - \mathrm{conf}_i\big|.\]Brier score (mean squared error on probabilities):
(10)¶\[\mathrm{Brier} \;=\; \frac{1}{W}\sum_{j=1}^{N} w_j \, (p_j - y_j)^2.\]
Lower ECE/MCE/Brier indicate better calibration (and accuracy for Brier).
Interpretation:
Diagonal (:math:`y=x`): Reference for perfect calibration.
Points above diagonal \((\mathrm{acc}_i > \mathrm{conf}_i)\) ⇒ model is under-confident in that bin.
Points below diagonal \((\mathrm{acc}_i < \mathrm{conf}_i)\) ⇒ model is over-confident in that bin.
Counts panel: A histogram of \(p_j\) per bin reveals data coverage; sparse bins tend to have larger uncertainty intervals.
Multiple models: Curves are overlaid; compare proximity to the diagonal and reported ECE/Brier in the legend.
Binning strategies:
Uniform: fixed-width bins on \([0,1]\) (e.g., 10 bins).
Quantile: bins formed so each has (approximately) equal counts. This stabilizes variance of \(\mathrm{acc}_i\) but can yield irregular edges if many identical scores occur.
Use Cases:
Calibrating classifiers that output probabilities (logistic regression, gradient boosting, neural nets).
Comparing models or calibration methods (e.g., Platt scaling vs. isotonic regression).
Communicating reliability: the diagram shows at a glance if a model is systematically over-/under-confident and where.
Advantages:
Local view of calibration (per bin) instead of a single scalar.
Uncertainty-aware via bin-wise intervals.
Distribution-aware with the counts panel, showing score sharpness and data coverage.
Example: (See the Gallery example for a complete, runnable snippet that saves an image and returns per-bin statistics.)
Comparing Metrics Across Horizons (plot_horizon_metrics())¶
Purpose: This function creates a polar bar chart, a novel visualization developed as part of the analytics framework in Kouadio et al.[2], to visually compare key metrics across a set of distinct categories, most commonly different forecast horizons (e.g., H+1, H+2, etc.). It is designed to answer questions like: “How does my model’s uncertainty (interval width) and central tendency (median prediction) evolve as it forecasts further into the future?”
Mathematical Concept: The plot summarizes metrics for \(N\) horizons (corresponding to the rows in the input df) using data from \(M\) samples (corresponding to the provided columns for each quantile). Let the input data be represented by matrices for the lower, upper, and median quantiles: \(\mathbf{L}\), \(\mathbf{U}\), and \(\mathbf{Q50}\), all of shape \((N, M)\).
Interval Width Calculation: First, a matrix of interval widths \(\mathbf{W}\) of shape \((N, M)\) is computed by element-wise subtraction. Each element \(W_{j,i}\) represents the interval width for horizon \(j\) and sample \(i\).
(11)¶\[W_{j,i} = U_{j,i} - L_{j,i}\]Radial Value (Bar Height): The primary metric plotted as the bar height (radial value \(r_j\)) for each horizon \(j\) is the mean of its interval widths across all \(M\) samples.
(12)¶\[r_j = \frac{1}{M} \sum_{i=0}^{M-1} W_{j,i}\]If normalize_radius=True, these values are then min-max scaled to the range [0, 1].
Color Value: The secondary metric, encoded as color, is the mean of the Q50 values for each horizon \(j\).
(13)¶\[c_j = \frac{1}{M} \sum_{i=0}^{M-1} Q50_{j,i}\]If q50_cols are not provided, the color value defaults to the radial value, \(c_j = r_j\). These color values are then mapped to a colormap via a standard normalization.
Interpretation:
Angle: Each angular segment represents a different horizon or category, as specified by the
xtick_labelsparameter. The plot typically starts at the top (12 o’clock) and proceeds clockwise.Radius (Bar Height): The length of each bar indicates the magnitude of the primary metric (e.g., mean interval width). Longer bars signify larger values.
Color: The color of each bar represents the magnitude of the secondary metric (e.g., mean Q50 value). The color bar on the side of the plot provides the scale for this metric.
Use Cases:
Analyzing Uncertainty Drift: Track how a model’s predictive uncertainty (interval width) grows or shrinks over a forecast horizon.
Comparing Forecast Magnitudes: Simultaneously visualize how the central tendency (Q50) of the forecast changes along with its uncertainty.
Comparing Models: Generate this plot for multiple models to compare their uncertainty profiles over time. A model with shorter, more stable bars may be preferable.
Categorical Performance: The “horizons” can represent any set of categories, such as different geographic regions or model configurations, to compare aggregated metrics.
Advantages (Polar Bar Context):
Intuitive Comparison: The circular layout allows for easy comparison of values across sequential categories.
Two-Dimensional Insight: It effectively encodes two different metrics (bar height and bar color) for each category in a single, compact plot.
Highlights Trends: Trends across horizons, such as consistently increasing uncertainty, are immediately apparent.
Example: (See the Horizon Metrics Example in the Gallery)
References