kdiagram.datasets.make_uncertainty_data¶

kdiagram.datasets.make_uncertainty_data(n_samples=150, n_periods=4, anomaly_frac=0.15, start_year=2022, prefix='value', base_value=10.0, trend_strength=1.5, noise_level=2.0, interval_width_base=4.0, interval_width_noise=1.5, interval_width_trend=0.5, seed=42, as_frame=False)[source]¶

Generate a synthetic multi-period uncertainty dataset.

Creates a compact dataset for testing k-diagram uncertainty visualizations: simulated actuals (for the first period), quantile predictions Q10/Q50/Q90 over multiple periods, controllable trends and noise, injected interval-coverage failures (anomalies), and simple spatial features. This is useful for coverage, calibration, drift, and consistency diagnostics [1][2][3].

Parameters:

n_samplesint, default=150: Number of rows (locations) to generate.
n_periodsint, default=4: Number of consecutive periods (e.g., years) for which to generate quantiles.
anomaly_fracfloat, default=0.15: Fraction in [0, 1] of rows whose first-period actual is forced outside the Q10–Q90 interval (half under-, half over-prediction, up to rounding).
start_yearint, default=2022: First period’s year used in column names.
prefixstr, default=’value’: Base prefix for generated value/quantile columns.
base_valuefloat, default=10.0: Mean level for the latent signal that drives Q50.
trend_strengthfloat, default=1.5: Linear trend added to Q50 by period index (lead time).
noise_levelfloat, default=2.0: Standard deviation for Gaussian noise added to the latent signal (for Q50 and actuals).
interval_width_basefloat, default=4.0: Baseline width of the Q10–Q90 interval in the first period.
interval_width_noisefloat, default=1.5: Uniform jitter magnitude applied per row/period to the interval width.
interval_width_trendfloat, default=0.5: Linear trend added to interval width across periods.
seedint or None, default=42: NumPy RNG seed for reproducibility. If None, a fresh RNG is used.
as_framebool, default=False: If False, return a Bunch with arrays and metadata. If True, return only the pandas DataFrame.

Returns:

dataBunch or pandas.DataFrame

If as_frame=False (default), a Bunch with:

frame : pandas DataFrame with spatial features, first-period actual, and Q10/Q50/Q90 columns by period.
feature_names : ['location_id','longitude','latitude', 'elevation'].
target_names : [f'{prefix}_actual'].
target : ndarray of actual values.
quantile_cols : dict mapping 'q0.1', 'q0.5', 'q0.9' to lists of column names across periods.
q10_cols, q50_cols, q90_cols : convenience lists.
n_periods : number of generated periods.
prefix : the column name prefix.
DESCR : human-readable description.

If as_frame=True, only the pandas DataFrame is returned.

Raises:

TypeError: If numeric inputs cannot be processed.

Parameters:

n_samples (int)
n_periods (int)
anomaly_frac (float)
start_year (int)
prefix (str)
base_value (float)
trend_strength (float)
noise_level (float)
interval_width_base (float)
interval_width_noise (float)
interval_width_trend (float)
seed (int | None)
as_frame (bool)

Return type:

Bunch | DataFrame

See also

kdiagram.plot.uncertainty.plot_coverage: Aggregate empirical coverage vs. nominal levels.
kdiagram.plot.uncertainty.plot_coverage_diagnostic: Point-wise success/failure on a polar layout.
kdiagram.plot.uncertainty.plot_interval_consistency: Temporal stability of interval widths per location.
kdiagram.plot.uncertainty.plot_model_drift: Lead-time trend of mean interval width.
kdiagram.plot.uncertainty.plot_anomaly_magnitude: Where and how severely intervals fail.

Notes

Column naming. Quantile columns encode the year \(y\) and quantile level \(q\):

(1)¶\[\text{quantile name} \;\equiv\; \texttt{<prefix>}\_{y}\_\texttt{q}q, \qquad y \in \{\texttt{start\_year},\dots\}, \;\; q \in \{0.1,0.5,0.9\}.\]

The first-period actual is stored once as f"{prefix}_actual".

Signal and interval model. Let period index be \(t \in \{0,\dots,n\_\text{periods}-1\}\) and row index \(i\). Define latent base signal \(s_i\) and Q50:

(2)¶\[s_i \;=\; \texttt{base\_value} \;+\; \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2),\; \sigma=\texttt{noise\_level}/2,\]

(3)¶\[Q50_{i,t} \;=\; s_i \;+\; t\cdot\texttt{trend\_strength} \;+\; \eta_{i,t}, \quad \eta_{i,t} \sim \mathcal{N}\!\big(0, (\texttt{noise\_level}/3)^2\big).\]

Interval width \(w_{i,t}\) has baseline, trend, and jitter:

(4)¶\[w_{i,t} \;=\; \max\!\Bigl( 0.1,\, \texttt{interval\_width\_base} + t\cdot\texttt{interval\_width\_trend} + u_{i,t} \Bigr), \quad u_{i,t} \sim \mathcal{U}\!\Bigl(-\tfrac{ \texttt{interval\_width\_noise}}{2},\, \tfrac{\texttt{interval\_width\_noise}}{2}\Bigr),\]

and

(5)¶\[Q10_{i,t} \;=\; Q50_{i,t} - \tfrac{1}{2}w_{i,t},\qquad Q90_{i,t} \;=\; Q50_{i,t} + \tfrac{1}{2}w_{i,t}.\]

Anomaly injection (first period). For a fraction anomaly_frac of rows we enforce a coverage failure:

(6)¶\[y^{\text{actual}}_{i} \notin [\,Q10_{i,0},\,Q90_{i,0}\,],\]

splitting under/over cases approximately evenly to aid tests of coverage diagnostics and anomaly magnitude plots. Use this data to study calibration vs. sharpness trade-offs [2] and operational verification practice [1].

References

Examples

>>> # Return a Bunch and inspect quantile columns:
>>>
>>> from kdiagram.datasets import make_uncertainty_data
>>> ds = make_uncertainty_data(n_samples=12, n_periods=3, seed=7)
>>> sorted(ds.quantile_cols.keys())
['q0.1', 'q0.5', 'q0.9']
>>>
>>> # Return only a DataFrame and check column order:
>>>
>>> df = make_uncertainty_data(as_frame=True, n_samples=5, seed=0)
>>> df.columns[:6].tolist()  # features + actual then Q10/Q50/Q90
['location_id', 'longitude', 'latitude', 'elevation',
 f'{ 'value'}_actual', 'value_2022_q0.1']