kdiagram.datasets.make_uncertainty_data¶
- kdiagram.datasets.make_uncertainty_data(n_samples=150, n_periods=4, anomaly_frac=0.15, start_year=2022, prefix='value', base_value=10.0, trend_strength=1.5, noise_level=2.0, interval_width_base=4.0, interval_width_noise=1.5, interval_width_trend=0.5, seed=42, as_frame=False)[source]¶
Generate a synthetic multi-period uncertainty dataset.
Creates a compact dataset for testing k-diagram uncertainty visualizations: simulated actuals (for the first period), quantile predictions Q10/Q50/Q90 over multiple periods, controllable trends and noise, injected interval-coverage failures (anomalies), and simple spatial features. This is useful for coverage, calibration, drift, and consistency diagnostics [1][2][3].
- Parameters:
- n_samples
int, default=150 Number of rows (locations) to generate.
- n_periods
int, default=4 Number of consecutive periods (e.g., years) for which to generate quantiles.
- anomaly_frac
float, default=0.15 Fraction in
[0, 1]of rows whose first-period actual is forced outside the Q10–Q90 interval (half under-, half over-prediction, up to rounding).- start_year
int, default=2022 First period’s year used in column names.
- prefix
str, default=’value’ Base prefix for generated value/quantile columns.
- base_value
float, default=10.0 Mean level for the latent signal that drives Q50.
- trend_strength
float, default=1.5 Linear trend added to Q50 by period index (lead time).
- noise_level
float, default=2.0 Standard deviation for Gaussian noise added to the latent signal (for Q50 and actuals).
- interval_width_base
float, default=4.0 Baseline width of the Q10–Q90 interval in the first period.
- interval_width_noise
float, default=1.5 Uniform jitter magnitude applied per row/period to the interval width.
- interval_width_trend
float, default=0.5 Linear trend added to interval width across periods.
- seed
intorNone, default=42 NumPy RNG seed for reproducibility. If
None, a fresh RNG is used.- as_framebool, default=False
If
False, return aBunchwith arrays and metadata. IfTrue, return only the pandasDataFrame.
- n_samples
- Returns:
- data
Bunchorpandas.DataFrame If
as_frame=False(default), a Bunch with:frame: pandasDataFramewith spatial features, first-period actual, and Q10/Q50/Q90 columns by period.feature_names:['location_id','longitude','latitude', 'elevation'].target_names:[f'{prefix}_actual'].target:ndarrayof actual values.quantile_cols: dict mapping'q0.1','q0.5','q0.9'to lists of column names across periods.q10_cols,q50_cols,q90_cols: convenience lists.n_periods: number of generated periods.prefix: the column name prefix.DESCR: human-readable description.
If
as_frame=True, only the pandasDataFrameis returned.
- data
- Raises:
TypeErrorIf numeric inputs cannot be processed.
- Parameters:
- Return type:
Bunch | DataFrame
See also
kdiagram.plot.uncertainty.plot_coverageAggregate empirical coverage vs. nominal levels.
kdiagram.plot.uncertainty.plot_coverage_diagnosticPoint-wise success/failure on a polar layout.
kdiagram.plot.uncertainty.plot_interval_consistencyTemporal stability of interval widths per location.
kdiagram.plot.uncertainty.plot_model_driftLead-time trend of mean interval width.
kdiagram.plot.uncertainty.plot_anomaly_magnitudeWhere and how severely intervals fail.
Notes
Column naming. Quantile columns encode the year \(y\) and quantile level \(q\):
(1)¶\[\text{quantile name} \;\equiv\; \texttt{<prefix>}\_{y}\_\texttt{q}q, \qquad y \in \{\texttt{start\_year},\dots\}, \;\; q \in \{0.1,0.5,0.9\}.\]The first-period actual is stored once as
f"{prefix}_actual".Signal and interval model. Let period index be \(t \in \{0,\dots,n\_\text{periods}-1\}\) and row index \(i\). Define latent base signal \(s_i\) and Q50:
(2)¶\[s_i \;=\; \texttt{base\_value} \;+\; \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2),\; \sigma=\texttt{noise\_level}/2,\](3)¶\[Q50_{i,t} \;=\; s_i \;+\; t\cdot\texttt{trend\_strength} \;+\; \eta_{i,t}, \quad \eta_{i,t} \sim \mathcal{N}\!\big(0, (\texttt{noise\_level}/3)^2\big).\]Interval width \(w_{i,t}\) has baseline, trend, and jitter:
(4)¶\[w_{i,t} \;=\; \max\!\Bigl( 0.1,\, \texttt{interval\_width\_base} + t\cdot\texttt{interval\_width\_trend} + u_{i,t} \Bigr), \quad u_{i,t} \sim \mathcal{U}\!\Bigl(-\tfrac{ \texttt{interval\_width\_noise}}{2},\, \tfrac{\texttt{interval\_width\_noise}}{2}\Bigr),\]and
(5)¶\[Q10_{i,t} \;=\; Q50_{i,t} - \tfrac{1}{2}w_{i,t},\qquad Q90_{i,t} \;=\; Q50_{i,t} + \tfrac{1}{2}w_{i,t}.\]Anomaly injection (first period). For a fraction
anomaly_fracof rows we enforce a coverage failure:(6)¶\[y^{\text{actual}}_{i} \notin [\,Q10_{i,0},\,Q90_{i,0}\,],\]splitting under/over cases approximately evenly to aid tests of coverage diagnostics and anomaly magnitude plots. Use this data to study calibration vs. sharpness trade-offs [2] and operational verification practice [1].
References
Examples
>>> # Return a Bunch and inspect quantile columns: >>> >>> from kdiagram.datasets import make_uncertainty_data >>> ds = make_uncertainty_data(n_samples=12, n_periods=3, seed=7) >>> sorted(ds.quantile_cols.keys()) ['q0.1', 'q0.5', 'q0.9'] >>> >>> # Return only a DataFrame and check column order: >>> >>> df = make_uncertainty_data(as_frame=True, n_samples=5, seed=0) >>> df.columns[:6].tolist() # features + actual then Q10/Q50/Q90 ['location_id', 'longitude', 'latitude', 'elevation', f'{ 'value'}_actual', 'value_2022_q0.1']