Mathematical Utilities

While the core of k-diagram is visualization, quantitative analysis is the foundation of any good forecast evaluation. The kdiagram.utils.mathext module provides a suite of mathematical and data extraction utilities designed to compute key performance metrics and prepare data for analysis.

These functions provide the numerical backbone for the diagnostic plots, allowing users to access the underlying scores for custom analysis, reporting, or integration into other workflows. They handle common tasks such as calculating proper scoring rules, assessing calibration, and extracting data from pandas DataFrames into a format suitable for numerical computation.

Summary of Mathematical Utility Functions

Mathematical Utility Functions

Function

Description

get_forecast_arrays()

Extracts and validates true and predicted values from a DataFrame into NumPy arrays or pandas objects.

compute_coverage_score()

Calculates the empirical coverage of a prediction interval, with options to check for over- and under-prediction.

compute_winkler_score()

Computes the Winkler score, which evaluates both the sharpness and calibration of a prediction interval.

compute_pinball_loss()

Calculates the Pinball Loss for a single quantile forecast, the foundational metric for quantile evaluation.

compute_crps()

Approximates the Continuous Ranked Probability Score (CRPS) by averaging the Pinball Loss over all quantiles.

compute_pit()

Computes the Probability Integral Transform (PIT) value for each observation to assess calibration.

calculate_calibration_error()

Quantifies the overall calibration error using the Kolmogorov-Smirnov statistic on PIT values.

build_cdf_interpolator()

Creates a callable empirical CDF from a set of quantile forecasts.

minmax_scaler()

Scales features to a specified range, robust to zero-variance features.


Extracting Forecast Arrays (get_forecast_arrays())

Purpose: This is a flexible extraction utility that serves as the primary bridge between a DataFrame-centric workflow and the NumPy-based mathematical and plotting functions in k-diagram. It handles the critical tasks of selecting the correct columns, cleaning the data by dropping or filling missing values, and converting the output to the desired format (either NumPy arrays or pandas objects). This function streamlines that process by handling the critical tasks of:

  • Selecting the correct columns for true values and predictions.

  • Cleaning the data by dropping or filling missing values.

  • Converting the output to the desired format (NumPy arrays or pandas objects).

Key Parameters Explained: While the function has many options, a few key parameters control its main behavior:

  • `return_as`: Determines the output type. Use ‘numpy’ (default) when you need raw arrays for mathematical computations. Use ‘pandas’ when you want to preserve the index and column names for further data manipulation.

  • `drop_na`: Controls how missing data is handled. By default, it removes any row where the actual_col or any of the pred_cols are NaN.

  • `squeeze`: When you request a single prediction column (pred_cols=’column_name’), squeeze=True (default) returns a 1D array or Series. Set it to False to maintain a 2D column vector shape (n, 1), which is sometimes required for other libraries.

Conceptual Workflow This function executes a sequence of data validation and transformation steps to ensure the output is clean and correctly formatted for downstream analysis.

  1. Column Selection: The function first identifies the full set of required columns based on the actual_col and pred_cols arguments and validates their existence in the input DataFrame.

  2. Data Subsetting and Cleaning:

    1. A subset of the DataFrame containing only the required columns is created.

    2. If fillna is specified, missing values are imputed using the provided strategy.

    3. If drop_na=True, rows with remaining missing values are dropped according to the na_policy (‘any’ or ‘all’).

  3. Type Coercion (Optional): If ensure_numeric=True, the function attempts to convert all selected columns to a numeric data type, either raising an error or coercing invalid values to NaN based on the coerce_numeric flag.

  4. Output Formatting: The cleaned and validated data is then converted to the desired output format specified by return_as (‘numpy’ or ‘pandas’). If a single prediction column is requested and squeeze=True, the output is reduced to a 1D array or Series.

Mathematical Formulation: The function can be understood as a sequence of data transformation operations. Let \(\mathbf{DF}\) be the input DataFrame, \(c_a\) be the name of the actual column, and \(\mathbf{C}_p\) be the set of prediction column names. The process is as follows:

(1)\[\begin{split}\begin{aligned} & \text{1. Subset:} & \mathbf{DF}_{sub} &\leftarrow \mathbf{DF}[c_a \cup \mathbf{C}_p] \\ & \text{2. Clean:} & \mathbf{DF}_{clean} &\leftarrow \mathcal{C}(\mathbf{DF}_{sub}, \text{policy}) \\ & \text{3. Extract:} & \mathbf{y}_{true} &\leftarrow \mathbf{DF}_{clean}[c_a] \\ & & \mathbf{Y}_{pred} &\leftarrow \mathbf{DF}_{clean}[\mathbf{C}_p] \\ & \text{4. Return:} & & (\mathbf{y}_{true}, \mathbf{Y}_{pred}) \end{aligned}\end{split}\]

where:

  • \(\mathbf{DF}_{sub}\) is the subset of the original DataFrame containing only the columns of interest.

  • \(\mathcal{C}\) is a cleaning operator that applies the fillna and dropna policies to the subsetted data.

  • \(\mathbf{y}_{true}\) is the final vector of true values and \(\mathbf{Y}_{pred}\) is the final vector or matrix of predicted values, extracted from the cleaned DataFrame.

Examples: The following example demonstrates how to extract true values and a set of quantile predictions from a DataFrame that contains missing values.

Basic Extraction (NumPy Output): This example demonstrates the default behavior: extracting true values and a set of quantile predictions from a DataFrame that contains a missing value. The function automatically drops the row with the NaN before returning the clean NumPy arrays.

 1import pandas as pd
 2import numpy as np
 3import kdiagram.utils as kdu
 4
 5# Create a sample DataFrame with a missing value
 6df = pd.DataFrame({
 7   'actual': [10, 20, 30, 40, np.nan],
 8   'pred_point': [12, 18, 33, 42, 48],
 9   'q10': [8, 15, 25, 35, 45],
10   'q90': [12, 25, 35, 45, 55],
11})
12
13# Extract the actual values and the Q10/Q90 predictions
14y_true, y_preds_q = kdu.get_forecast_arrays(
15 df, actual_col='actual', pred_cols=['q10', 'q90']
16)
17
18print("--- True Values (NumPy) ---")
19print(y_true)
20print("\n--- Quantile Predictions (NumPy) ---")
21print(y_preds_q)
Expected Output
--- True Values (NumPy) ---
[10. 20. 30. 40.]

--- Quantile Predictions (NumPy) ---
[[ 8 12]
 [15 25]
 [25 35]
 [35 45]]

Pandas Output with Index: This example shows how to extract a single point prediction as a pandas Series, keeping the original index and without dropping missing values.

 1# Using the same DataFrame as above
 2y_preds_series = kdu.get_forecast_arrays(
 3    df,
 4    pred_cols='pred_point',
 5    return_as='pandas',
 6    drop_na=False
 7)
 8
 9print("\n--- Point Predictions (pandas Series) ---")
10print(y_preds_series)
Expected Output
--- Point Predictions (pandas Series) ---
0    12
1    18
2    33
3    42
4    48
Name: pred_point, dtype: int64

Computing Coverage Scores (compute_coverage_score())

Purpose: This utility calculates the empirical coverage of a prediction interval. It is a fundamental metric for assessing the calibration of a forecast’s uncertainty bounds. A forecast is well-calibrated if its \((1-\alpha) \cdot 100\%\) prediction intervals contain the true observed value approximately \((1-\alpha) \cdot 100\%\) of the time.

The function is versatile, allowing you to calculate not just the standard coverage score (the proportion of true values within the interval), but also the proportion of values falling above or below the interval. This is crucial for diagnosing the direction of miscalibration.

Key Parameters Explained:

  • `method`: This parameter controls which type of coverage is calculated.

    • 'within': This is the standard coverage. It tells you the fraction of time your forecast was “correct” in its uncertainty estimate.

    • 'below': This calculates the fraction of times the true value was lower than your lower bound. A high value indicates your model’s intervals are systematically too high.

    • 'above': This calculates the fraction of times the true value was higher than your upper bound. A high value indicates your model’s intervals are systematically too low.

  • `return_counts`: By default, the function returns a proportion (a float between 0 and 1). Setting this to True returns the raw integer count, which can be useful for reports or further statistical tests.

Mathematical Concept: The empirical coverage is a key diagnostic for checking if a model’s prediction intervals are well-calibrated. For a given \((1-\alpha) \cdot 100\%\) prediction interval, the empirical coverage should be close to \(1-\alpha\).

The function calculates one of three scores for a set of \(N\) observations, where \(\mathbf{1}\) is the indicator function:

  1. Within-Interval Coverage (method='within'):

    (2)\[\text{Coverage} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{y_{lower,i} \le y_{true,i} \le y_{upper,i}\}\]
  2. Below-Interval Rate (method='below'):

    (3)\[\text{Rate}_{below} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{y_{true,i} < y_{lower,i}\}\]
  3. Above-Interval Rate (method='above'):

    (4)\[\text{Rate}_{above} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{y_{true,i} > y_{upper,i}\}\]

Examples:

Basic Usage: The following example demonstrates how to compute the standard coverage score, as well as the raw count of observations that fall below the specified interval.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Create sample data
 5y_true = np.array([1, 2, 3, 4, 5, 6])
 6y_lower = np.array([0, 3, 2, 5, 4, 7])
 7y_upper = np.array([2, 4, 4, 6, 6, 8])
 8
 9# Calculate the standard coverage (4 out of 6 are within)
10coverage = kdu.compute_coverage_score(y_true, y_lower, y_upper)
11print(f"Coverage Score: {coverage:.2f}")
12
13# Calculate the number of points below the interval
14count_below = kdu.compute_coverage_score(
15    y_true, y_lower, y_upper, method='below', return_counts=True
16)
17print(f"Count below interval: {count_below}")
Expected Output
Coverage Score: 0.67
Count below interval: 2

Diagnosing Miscalibration: A well-calibrated 80% prediction interval (e.g., from Q10 to Q90) should have approximately 10% of observations below the lower bound and 10% above the upper bound. We can use this function to check.

 1# Simulate a model whose intervals are systematically too low
 2np.random.seed(0)
 3y_true = np.random.normal(loc=10, scale=2, size=1000)
 4y_lower_biased = y_true - 3 # Lower bound is too low
 5y_upper_biased = y_true + 1 # Upper bound is too low
 6
 7# Calculate the rates
 8rate_within = kdu.compute_coverage_score(
 9    y_true, y_lower_biased, y_upper_biased, method='within'
10)
11rate_below = kdu.compute_coverage_score(
12    y_true, y_lower_biased, y_upper_biased, method='below'
13)
14rate_above = kdu.compute_coverage_score(
15    y_true, y_lower_biased, y_upper_biased, method='above'
16)
17
18print(f"Coverage (within interval): {rate_within:.2f}")
19print(f"Rate below interval: {rate_below:.2f}")
20print(f"Rate above interval: {rate_above:.2f}")
Expected Output
Coverage (within interval): 0.69
Rate below interval: 0.00
Rate above interval: 0.31

The output clearly shows the miscalibration: far too many observations (31%) are falling above the upper bound, confirming that the prediction intervals are biased low.


Computing the Winkler Score (compute_winkler_score())

Purpose This utility calculates the Winkler score, a proper scoring rule designed specifically for evaluating prediction intervals. It is a powerful metric because it simultaneously rewards sharpness (narrow intervals) while heavily penalizing for a lack of calibration (when the true value falls outside the interval). A lower score is better.

Key Parameters Explained

  • `alpha`: This is the significance level of the prediction interval. It determines how heavily the score penalizes observations that fall outside the bounds. For a 90% prediction interval (from Q5 to Q95), the alpha would be 0.1. For an 80% interval (Q10 to Q90), the alpha would be 0.2.

Mathematical Concept: The Winkler score [1] is designed to evaluate both the sharpness and calibration of a prediction interval simultaneously. The score for a single observation \(y\) and a \((1-\alpha)\) prediction interval \([l, u]\) is defined as:

(5)\[\begin{split}S_{\alpha}(l, u, y) = (u - l) + \begin{cases} \frac{2}{\alpha}(l - y) & \text{if } y < l \\ 0 & \text{if } l \le y \le u \\ \frac{2}{\alpha}(y - u) & \text{if } y > u \end{cases}\end{split}\]

The first term, \((u - l)\), is the interval width, which rewards sharpness (narrower intervals). The second term is a penalty that is applied only if the observation falls outside the interval. The penalty increases as the observation gets further from the violated bound. This function returns the average of this score over all observations.

Example: The following example demonstrates how to calculate the Winkler score for a set of forecasts.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Create sample data
 5y_true = np.array([1, 5, 12])
 6y_lower = np.array([2, 4, 8])
 7y_upper = np.array([8, 6, 10])
 8
 9# For a 90% interval, alpha = 0.1
10# Obs 1 (y=1): outside. Width=6. Penalty=(2/0.1)*(2-1)=20. Score=26.
11# Obs 2 (y=5): inside. Width=2. Penalty=0. Score=2.
12# Obs 3 (y=12): outside. Width=2. Penalty=(2/0.1)*(12-10)=40. Score=42.
13# Average = (26 + 2 + 42) / 3 = 23.33
14
15score = kdu.compute_winkler_score(
16    y_true, y_lower, y_upper, alpha=0.1
17)
18print(f"Average Winkler Score (alpha=0.1): {score:.2f}")
Expected Output
Average Winkler Score (alpha=0.1): 23.33

Computing the Pinball Loss (compute_pinball_loss())

Purpose: This utility calculates the Pinball Loss, a fundamental metric used to evaluate the accuracy of a single quantile forecast. It is the building block for the Continuous Ranked Probability Score (CRPS). A lower score indicates a more accurate quantile forecast.

Mathematical Concept: The Pinball Loss, \(\mathcal{L}_{\tau}\), is a proper scoring rule for a single quantile forecast \(q\) at level \(\tau\) against an observation \(y\). Its key feature is that it asymmetrically penalizes errors. It gives a weight of \(\tau\) to under-predictions (when \(y > q\)) and a weight of \((1 - \tau)\) to over-predictions (when \(y < q\)).

(6)\[\begin{split}\mathcal{L}_{\tau}(q, y) = \begin{cases} (y - q) \tau & \text{if } y \ge q \\ (q - y) (1 - \tau) & \text{if } y < q \end{cases}\end{split}\]

This function calculates the average of this loss over all provided observations.

Example: The following example demonstrates how to calculate the average Pinball Loss for a 90th percentile (Q90) forecast.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Create sample data
 5y_true = np.array([10, 10, 5])
 6y_pred_q90 = np.array([8, 12, 5]) # Under-predict, over-predict, exact
 7quantile = 0.9
 8
 9# Loss for y=10, q=8: (10-8) * 0.9 = 1.8
10# Loss for y=10, q=12: (12-10) * (1-0.9) = 0.2
11# Loss for y=5, q=5: (5-5) * 0.9 = 0.0
12# Average = (1.8 + 0.2 + 0.0) / 3 = 0.667
13
14loss = kdu.compute_pinball_loss(y_true, y_pred_q90, quantile)
15print(f"Average Pinball Loss for Q90: {loss:.3f}")
Expected Output
Average Pinball Loss for Q90: 0.667

Computing the CRPS (compute_crps())

Purpose: This utility approximates the Continuous Ranked Probability Score (CRPS), a proper scoring rule that provides a single, comprehensive measure of a probabilistic forecast’s quality. It generalizes the Mean Absolute Error to a probabilistic setting and simultaneously assesses both calibration and sharpness. A lower CRPS value indicates a better forecast.

Mathematical Concept: The Continuous Ranked Probability Score (CRPS) is a widely used metric for evaluating probabilistic forecasts [1]. For a single observation \(y\) and a predictive CDF \(F\), it is defined as the integrated squared difference between the forecast CDF and the empirical CDF of the observation:

(7)\[\text{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(x) - \mathbf{1}\{x \ge y\})^2 dx\]

where \(\mathbf{1}\) is the Heaviside step function.

When the forecast is given as a set of \(M\) quantiles, the CRPS is approximated by averaging the Pinball Loss \(\mathcal{L}_{\tau}\) over all provided quantile levels \(\tau\). The final score is the average over all observations and all quantiles.

Interpretation The CRPS provides a single number to summarize the overall performance of a probabilistic forecast.

  • Lower is Better: A model with a lower average CRPS is considered superior, as it indicates a better combination of calibration and sharpness.

  • Units: The CRPS is expressed in the same units as the observed variable, making it easy to interpret.

Use Cases

  • To get a single, high-level summary score for comparing the overall performance of multiple probabilistic models.

  • To use as the primary objective function when tuning a probabilistic forecasting model.

  • To use alongside diagnostic plots like the PIT Histogram and Sharpness Diagram to understand why one model has a better CRPS than another.

Example The following example demonstrates how to calculate the average CRPS for a set of quantile forecasts.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Define true values and quantile forecasts for 2 observations
 5y_true = np.array([10, 25])
 6quantiles = np.array([0.1, 0.5, 0.9])
 7y_preds = np.array([
 8    [8, 11, 13],  # Forecast for y_true = 10
 9    [20, 22, 26]   # Forecast for y_true = 25
10])
11
12# Calculate the average CRPS
13crps_score = kdu.compute_crps(y_true, y_preds, quantiles)
14print(f"Average CRPS: {crps_score:.3f}")
Expected Output
Average CRPS: 1.467

Computing PIT Values (compute_pit())

Purpose: This utility computes the Probability Integral Transform (PIT) value for each individual observation in a dataset. The PIT is a fundamental score for assessing the calibration of a probabilistic forecast. The output of this function is an array of PIT values, which can then be visualized (e.g., with plot_pit_histogram()) or used to calculate summary statistics of calibration.

Mathematical Concept The Probability Integral Transform (PIT) is a foundational concept in forecast verification [1]. For a continuous predictive distribution with a Cumulative Distribution Function (CDF) denoted by \(F\), the PIT value for a given observation \(y\) is calculated as \(F(y)\).

When a predictive distribution is represented by a finite set of \(M\) quantiles, as is common in machine learning, the PIT value for each observation \(y_i\) is approximated. It is calculated as the fraction of the forecast quantiles that are less than or equal to the observed value:

(8)\[\text{PIT}_i = \frac{1}{M} \sum_{j=1}^{M} \mathbf{1}\{q_{i,j} \le y_i\}\]

where \(q_{i,j}\) is the \(j\)-th quantile forecast for observation \(i\), and \(\mathbf{1}\) is the indicator function. If a forecast is perfectly calibrated, the resulting array of PIT values will be uniformly distributed on the interval \([0, 1]\).

Example The following example demonstrates how to compute the PIT value for each observation in a small dataset.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Define true values and quantile forecasts for 3 observations
 5y_true = np.array([10, 1, 5.5])
 6quantiles = np.array([0.1, 0.5, 0.9])
 7y_preds = np.array([
 8    [8, 11, 13],  # Forecast for y_true = 10
 9    [0, 0.5, 2],  # Forecast for y_true = 1
10    [4, 5, 6]     # Forecast for y_true = 5.5
11])
12
13# Calculate the PIT value for each observation
14# - For y=10, 1/3 quantiles are <= 10 -> PIT = 0.333
15# - For y=1, 2/3 quantiles are <= 1 -> PIT = 0.667
16# - For y=5.5, 2/3 quantiles are <= 5.5 -> PIT = 0.667
17pit_values = kdu.compute_pit(y_true, y_preds, quantiles)
18print(pit_values)
Expected Output
[0.33333333 0.66666667 0.66666667]

Calculating Calibration Error (calculate_calibration_error())

Purpose: This utility quantifies the overall calibration error of a probabilistic forecast with a single numerical score. It works by first computing the Probability Integral Transform (PIT) values and then using the Kolmogorov-Smirnov (KS) statistic to measure how much their distribution deviates from the ideal uniform distribution. A lower score indicates better calibration.

Mathematical Concept: This function provides a summary statistic for the PIT histogram. A perfectly calibrated forecast produces PIT values that are uniformly distributed on \([0, 1]\). The calibration error is quantified by measuring the maximum difference between the empirical Cumulative Distribution Function (CDF) of the PIT values and the CDF of a perfect uniform distribution.

This maximum difference is the Kolmogorov-Smirnov (KS) statistic, \(D_n\).

(9)\[D_n = \sup_{x} | F_{PIT}(x) - U(x) |\]

where:

  • \(F_{PIT}(x)\) is the empirical CDF of the calculated PIT values.

  • \(U(x)\) is the CDF of the standard uniform distribution (i.e., \(U(x) = x\)).

  • \(\sup_{x}\) denotes the supremum of the set of distances.

The score is between 0 and 1, where 0 represents perfect calibration.

Example The following example demonstrates how to calculate the calibration error for a well-calibrated model and a poorly calibrated (overconfident) model.

 1import numpy as np
 2from scipy.stats import norm
 3import kdiagram.utils as kdu
 4
 5# Generate synthetic data
 6np.random.seed(42)
 7n_samples = 500
 8y_true = np.random.normal(loc=10, scale=5, size=n_samples)
 9quantiles = np.linspace(0.05, 0.95, 19)
10
11# A well-calibrated forecast
12good_preds = norm.ppf(
13    quantiles, loc=10, scale=5
14).reshape(1, -1).repeat(n_samples, axis=0)
15
16# A poorly calibrated (overconfident) forecast
17bad_preds = norm.ppf(
18    quantiles, loc=10, scale=2.5
19).reshape(1, -1).repeat(n_samples, axis=0)
20
21# Calculate the calibration error for both models
22calib_error_good = kdu.calculate_calibration_error(
23    y_true, good_preds, quantiles
24)
25calib_error_bad = kdu.calculate_calibration_error(
26    y_true, bad_preds, quantiles
27)
28
29print(f"Calibration Error (Good Model): {calib_error_good:.3f}")
30print(f"Calibration Error (Bad Model): {calib_error_bad:.3f}")
Expected Output
Calibration Error (Good Model): 0.035
Calibration Error (Bad Model): 0.298

Building a CDF Interpolator (build_cdf_interpolator())

Purpose: This is an advanced utility that constructs a callable empirical Cumulative Distribution Function (CDF) from a set of quantile forecasts. It returns a new function that can be used to find the estimated cumulative probability for any given value. This is a foundational tool for advanced probabilistic analysis, such as calculating PIT values or the probability of exceeding a critical threshold.

Mathematical Concept: The Probability Integral Transform (PIT) is a key concept in probabilistic forecast evaluation [1]. For a continuous predictive CDF \(F\), the PIT of an observation \(y\) is \(F(y)\). This utility constructs an empirical approximation of \(F\) for each forecast.

The function works by creating a closure: the returned _interpolator function “remembers” the quantile forecasts it was built with. For each observation \(y_i\), it performs a linear interpolation using the corresponding forecast quantiles \(\mathbf{q}_i = (q_{i,1}, ..., q_{i,M})\) as the x-coordinates and the quantile levels \(\mathbf{\tau} = (\tau_1, ..., \tau_M)\) as the y-coordinates. This allows you to estimate the cumulative probability for any value of \(y_i\).

Example: The following example demonstrates how to build the interpolator from a set of forecasts and then use the resulting function to calculate the PIT values for several new observations.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Forecasts for 3 observations at 3 quantiles (0.1, 0.5, 0.9)
 5preds_quantiles = np.array([
 6    [8, 10, 12],
 7    [0, 1, 2],
 8    [4, 5, 6]
 9])
10quantiles = np.array([0.1, 0.5, 0.9])
11
12# Build the interpolator from the forecast distributions
13cdf_func = kdu.build_cdf_interpolator(preds_quantiles, quantiles)
14
15# Now, use the new function to find the PIT for 3 observations
16y_true = np.array([10.0, 0.5, 5.5])
17pit_values = cdf_func(y_true)
18print(pit_values)
Expected Output
[0.5 0.3 0.7]

Min-Max Scaling (minmax_scaler())

Purpose: This utility scales features to a specified range, most commonly [0, 1]. Min-Max scaling is a standard preprocessing step for many machine learning algorithms that are sensitive to the magnitude of input features, such as neural networks and distance-based algorithms. This implementation is flexible to features with zero variance by adding a small epsilon to the denominator to prevent division-by-zero errors.

Mathematical Concept: The Min-Max scaling transformation is a linear operation. For each feature (column) in the input data \(\mathbf{X}\), the transformation is calculated as described in the scikit-learn documentation [2]:

(10)\[X_{\text{scaled}} = \text{min}_{\text{range}} + (\text{max}_{\text{range}} - \text{min}_{\text{range}}) \cdot \frac{\mathbf{X} - \min(\mathbf{X})} {(\max(\mathbf{X}) - \min(\mathbf{X})) + \varepsilon}\]

where:

  • \(\text{min}_{\text{range}}\) and \(\text{max}_{\text{range}}\) are the bounds of the feature_range.

  • \(\min(\mathbf{X})\) and \(\max(\mathbf{X})\) are the minimum and maximum values of the feature.

  • \(\varepsilon\) is a small epsilon to ensure numerical stability.

Example: The following example demonstrates how to scale a 2D array to the default [0, 1] range and to a custom [-1, 1] range.

 1import numpy as np
 2import kdiagram.utils as kdu
 3
 4# Create a sample 2D array
 5X = np.array([[1, 10], [2, 20], [3, 30]])
 6
 7# Scale to the default [0, 1] range
 8X_scaled_default = kdu.minmax_scaler(X)
 9print("--- Scaled to [0, 1] ---")
10print(X_scaled_default)
11
12# Scale to a custom [-1, 1] range
13X_scaled_custom = kdu.minmax_scaler(X, feature_range=(-1, 1))
14print("\n--- Scaled to [-1, 1] ---")
15print(X_scaled_custom)
Expected Output
--- Scaled to [0, 1] ---
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

--- Scaled to [-1, 1] ---
[[-1. -1.]
 [ 0.  0.]
 [ 1.  1.]]