kdiagram.datasets.make_uncertainty_data

kdiagram.datasets.make_uncertainty_data(n_samples=150, n_periods=4, anomaly_frac=0.15, start_year=2022, prefix='value', base_value=10.0, trend_strength=1.5, noise_level=2.0, interval_width_base=4.0, interval_width_noise=1.5, interval_width_trend=0.5, seed=42, as_frame=False)[source]

Generate a synthetic multi-period uncertainty dataset.

Creates a compact dataset for testing k-diagram uncertainty visualizations: simulated actuals (for the first period), quantile predictions Q10/Q50/Q90 over multiple periods, controllable trends and noise, injected interval-coverage failures (anomalies), and simple spatial features. This is useful for coverage, calibration, drift, and consistency diagnostics [1][2][3].

Parameters:
n_samplesint, default=150

Number of rows (locations) to generate.

n_periodsint, default=4

Number of consecutive periods (e.g., years) for which to generate quantiles.

anomaly_fracfloat, default=0.15

Fraction in [0, 1] of rows whose first-period actual is forced outside the Q10–Q90 interval (half under-, half over-prediction, up to rounding).

start_yearint, default=2022

First period’s year used in column names.

prefixstr, default=’value’

Base prefix for generated value/quantile columns.

base_valuefloat, default=10.0

Mean level for the latent signal that drives Q50.

trend_strengthfloat, default=1.5

Linear trend added to Q50 by period index (lead time).

noise_levelfloat, default=2.0

Standard deviation for Gaussian noise added to the latent signal (for Q50 and actuals).

interval_width_basefloat, default=4.0

Baseline width of the Q10–Q90 interval in the first period.

interval_width_noisefloat, default=1.5

Uniform jitter magnitude applied per row/period to the interval width.

interval_width_trendfloat, default=0.5

Linear trend added to interval width across periods.

seedint or None, default=42

NumPy RNG seed for reproducibility. If None, a fresh RNG is used.

as_framebool, default=False

If False, return a Bunch with arrays and metadata. If True, return only the pandas DataFrame.

Returns:
dataBunch or pandas.DataFrame

If as_frame=False (default), a Bunch with:

  • frame : pandas DataFrame with spatial features, first-period actual, and Q10/Q50/Q90 columns by period.

  • feature_names : ['location_id','longitude','latitude', 'elevation'].

  • target_names : [f'{prefix}_actual'].

  • target : ndarray of actual values.

  • quantile_cols : dict mapping 'q0.1', 'q0.5', 'q0.9' to lists of column names across periods.

  • q10_cols, q50_cols, q90_cols : convenience lists.

  • n_periods : number of generated periods.

  • prefix : the column name prefix.

  • DESCR : human-readable description.

If as_frame=True, only the pandas DataFrame is returned.

Raises:
TypeError

If numeric inputs cannot be processed.

Parameters:
Return type:

Bunch | DataFrame

See also

kdiagram.plot.uncertainty.plot_coverage

Aggregate empirical coverage vs. nominal levels.

kdiagram.plot.uncertainty.plot_coverage_diagnostic

Point-wise success/failure on a polar layout.

kdiagram.plot.uncertainty.plot_interval_consistency

Temporal stability of interval widths per location.

kdiagram.plot.uncertainty.plot_model_drift

Lead-time trend of mean interval width.

kdiagram.plot.uncertainty.plot_anomaly_magnitude

Where and how severely intervals fail.

Notes

Column naming. Quantile columns encode the year \(y\) and quantile level \(q\):

(1)\[\text{quantile name} \;\equiv\; \texttt{<prefix>}\_{y}\_\texttt{q}q, \qquad y \in \{\texttt{start\_year},\dots\}, \;\; q \in \{0.1,0.5,0.9\}.\]

The first-period actual is stored once as f"{prefix}_actual".

Signal and interval model. Let period index be \(t \in \{0,\dots,n\_\text{periods}-1\}\) and row index \(i\). Define latent base signal \(s_i\) and Q50:

(2)\[s_i \;=\; \texttt{base\_value} \;+\; \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2),\; \sigma=\texttt{noise\_level}/2,\]
(3)\[Q50_{i,t} \;=\; s_i \;+\; t\cdot\texttt{trend\_strength} \;+\; \eta_{i,t}, \quad \eta_{i,t} \sim \mathcal{N}\!\big(0, (\texttt{noise\_level}/3)^2\big).\]

Interval width \(w_{i,t}\) has baseline, trend, and jitter:

(4)\[w_{i,t} \;=\; \max\!\Bigl( 0.1,\, \texttt{interval\_width\_base} + t\cdot\texttt{interval\_width\_trend} + u_{i,t} \Bigr), \quad u_{i,t} \sim \mathcal{U}\!\Bigl(-\tfrac{ \texttt{interval\_width\_noise}}{2},\, \tfrac{\texttt{interval\_width\_noise}}{2}\Bigr),\]

and

(5)\[Q10_{i,t} \;=\; Q50_{i,t} - \tfrac{1}{2}w_{i,t},\qquad Q90_{i,t} \;=\; Q50_{i,t} + \tfrac{1}{2}w_{i,t}.\]

Anomaly injection (first period). For a fraction anomaly_frac of rows we enforce a coverage failure:

(6)\[y^{\text{actual}}_{i} \notin [\,Q10_{i,0},\,Q90_{i,0}\,],\]

splitting under/over cases approximately evenly to aid tests of coverage diagnostics and anomaly magnitude plots. Use this data to study calibration vs. sharpness trade-offs [2] and operational verification practice [1].

References

Examples

>>> # Return a Bunch and inspect quantile columns:
>>>
>>> from kdiagram.datasets import make_uncertainty_data
>>> ds = make_uncertainty_data(n_samples=12, n_periods=3, seed=7)
>>> sorted(ds.quantile_cols.keys())
['q0.1', 'q0.5', 'q0.9']
>>>
>>> # Return only a DataFrame and check column order:
>>>
>>> df = make_uncertainty_data(as_frame=True, n_samples=5, seed=0)
>>> df.columns[:6].tolist()  # features + actual then Q10/Q50/Q90
['location_id', 'longitude', 'latitude', 'elevation',
 f'{ 'value'}_actual', 'value_2022_q0.1']