Datasets

The kdiagram.datasets module provides convenient functions to access sample datasets included with the package (like the Zhongshan subsidence data) and to generate various synthetic datasets on the fly.

These datasets are invaluable for:

  • Running examples provided in the documentation and gallery.

  • Testing k-diagram’s plotting functions with predictable data structures.

  • Exploring different scenarios of uncertainty, drift, or model comparison.

Most functions allow you to retrieve data either as a standard pandas.DataFrame or as a Bunch object (using the as_frame parameter). The Bunch object conveniently packages the DataFrame along with metadata like feature/target names, relevant column lists, and a description of the dataset’s origin or generation parameters.

Function Summary

Dataset Loading and Generation Functions

Function

Description

load_uncertainty_data()

Generates synthetic multi-period quantile data with trends, noise, and anomalies. Ideal for drift/consistency plots.

load_zhongshan_subsidence()

Loads the included Zhongshan subsidence prediction sample dataset.

make_taylor_data()

Generates a reference series and multiple prediction series with controlled correlation/standard deviation for Taylor Diagrams.

make_multi_model_quantile_data()

Generates quantile predictions from multiple simulated models for a single time period. Useful for model comparison plots.

make_regression_data()

Generates a rich synthetic dataset for comparing regression models with configurable error profiles.

make_classification_data()

Generates a synthetic dataset for classification tasks, including predicted labels and probabilities.

make_cyclical_data()

Generates data with true and predicted series exhibiting cyclical/seasonal patterns.

make_fingerprint_data()

Generates a synthetic feature importance matrix for feature fingerprint (radar) plots.

Quantile Naming & Basic Notation

Synthetic quantile columns follow a consistent pattern:

(1)\[\text{colname} \;=\; \texttt{<prefix>}\_{\texttt{year}}\_\texttt{q}\alpha, \qquad \alpha \in \{0.1, 0.5, 0.9\}.\]

For an observation \(i\) at period \(t\), the interval width and pointwise coverage indicator are

(2)\[w_{i,t} \;=\; Q90_{i,t} - Q10_{i,t}, \qquad \mathbb{1}\!\left\{ Q10_{i,t} \le y_{i} \le Q90_{i,t} \right\}.\]

Aggregate empirical coverage across \(n\) points is

(3)\[\widehat{c} \;=\; \frac{1}{n} \sum_{i=1}^n \mathbb{1}\!\left\{ Q10_{i,t} \le y_i \le Q90_{i,t} \right\}, \qquad \text{compare against the nominal level.}\]

Tip

Keep seeds fixed (seed=...) to obtain deterministic examples in documentation and tests.


Usage Examples

Below are examples demonstrating how to use each function.

Loading Zhongshan Subsidence Data

load_zhongshan_subsidence() loads the packaged sample (coordinates, targets for 2022/2023, quantiles 2022–2026). You can subset by year and quantile at load time.

 1from kdiagram.datasets import load_zhongshan_subsidence
 2import warnings
 3
 4# Suppress potential download warnings if data exists locally
 5warnings.filterwarnings("ignore", message=".*already exists.*")
 6
 7# Load as DataFrame, subsetting years and quantiles
 8try:
 9    df_zhongshan_subset = load_zhongshan_subsidence(
10        as_frame=True,
11        years=[2023, 2025],
12        quantiles=[0.1, 0.9],
13        include_target=False, # Exclude 'subsidence_YYYY' cols
14        download_if_missing=True # Allow download if not packaged/cached
15    )
16    print("Loaded Zhongshan Subset DataFrame:")
17    print(df_zhongshan_subset.head(3))
18    print("\nColumns:")
19    print(df_zhongshan_subset.columns)
20
21except FileNotFoundError as e:
22    print(f"Error loading Zhongshan data: {e}")
23    print("Ensure the package data was installed correctly or "
24          "download is enabled/possible.")
25except Exception as e:
26     print(f"An unexpected error occurred: {e}")
Example Output (Structure, assuming load successful)
Loaded Zhongshan Subset DataFrame:
     longitude   latitude  subsidence_2023_q0.1  subsidence_2023_q0.9  subsidence_2025_q0.1  subsidence_2025_q0.9
0   113.237984  22.494591              ...              ...              ...              ...
1   113.220802  22.513592              ...              ...              ...              ...
2   113.225632  22.530231              ...              ...              ...              ...

Columns:
Index(['longitude', 'latitude', 'subsidence_2023_q0.1',
       'subsidence_2023_q0.9', 'subsidence_2025_q0.1',
       'subsidence_2025_q0.9'], dtype='object')

Loading Uncertainty Datasets (Synthetic vs Semi-Realistic)

There are two uncertainty-oriented helpers with different purposes:

1) Fully synthetic generator — make_uncertainty_data()

What it is. Programmatically constructs multi-period quantiles (Q10/Q50/Q90) with controllable median trend and interval-width dynamics, plus an optional fraction of injected coverage failures in the first period for testing diagnostics.

When to use. Benchmarks, tutorials, and unit-style checks where you want repeatable behavior and knobs (trend_strength, interval_width_*, anomaly_frac). Ideal for coverage summaries, pointwise diagnostics, and drift/consistency analyses [1][2].

1from kdiagram.datasets import make_uncertainty_data
2ds = make_uncertainty_data(
3    n_samples=200, n_periods=5,
4    trend_strength=1.2, interval_width_trend=0.4,
5    anomaly_frac=0.2, seed=7
6)
7df = ds.frame

2) Packaged semi-realistic sample — load_uncertainty_data()

What it is. Loads a compact, ready-to-use sample that mimics the schema and “feel” of the Zhongshan-style quantile outputs (years as periods, Q10/Q50/Q90 columns, and a single “actual” baseline), but without having to fetch the full Zhongshan dataset. Think of it as a toy clone of the real structure for quick demos.

When to use. You need data that “looks like” the Zhongshan project’s outputs (column naming and period layout) without network access or large files—e.g., to wire up gallery pages or quick API examples.

 1from kdiagram.datasets import load_uncertainty_data
 2toy = load_uncertainty_data(as_frame=False)  # Bunch with metadata
 3toy.frame.head()
 4
 5# Generate as Bunch (default)
 6data_bunch = load_uncertainty_data(
 7    n_samples=10, n_periods=2, seed=1, prefix="flow"
 8    )
 9
10print("--- Bunch Object ---")
11print(f"Keys: {list(data_bunch.keys())}")
12print(f"Description:\n{data_bunch.DESCR[:200]}...") # Print start of DESCR
13print("\nDataFrame Head:")
14print(data_bunch.frame.head(3))
15print("\nQ10 Columns:")
16print(data_bunch.q10_cols)
Example Output (Structure)
--- Bunch Object ---
Keys: ['frame', 'feature_names', 'target_names', 'target', 'quantile_cols', 'q10_cols', 'q50_cols', 'q90_cols', 'n_periods', 'prefix', 'start_year', 'DESCR']
Description:
Synthetic Multi-Period Uncertainty Dataset for k-diagram

**Description:**
Generates synthetic data simulating quantile forecasts (Q10,
Q50, Q90) for 'flow' over 2 periods starting
from 2022 across 10 samples/lo...

DataFrame Head:
   location_id  longitude   latitude   elevation  flow_actual  ...
0            0 -116.8388    35.094262  366.807627    16.816179  ...
1            1 -117.8696    34.045590  247.216119     9.508103  ...
2            2 -119.749534  35.488999  353.628218     5.439137  ...

Q10 Columns:
['flow_2022_q0.1', 'flow_2023_q0.1']

Quantile naming and the empirical coverage definition follow the conventions in [1][2]:

(4)\[w_{i,t} = Q90_{i,t} - Q10_{i,t}, \qquad \widehat{c} = \frac{1}{n}\sum_{i=1}^n \mathbb{1}\{Q10_{i,t} \le y_i \le Q90_{i,t}\}.\]

Generating Taylor Diagram Data

Taylor diagrams summarize correlation and standard deviation in a single polar plot Taylor[3]. Use make_taylor_data() to synthesize a reference and several model series with controllable spread and correlation (bias added but irrelevant to centered Taylor metrics) [2].

 1from kdiagram.datasets import make_taylor_data
 2
 3taylor_data = make_taylor_data(n_models=2, n_samples=50, seed=101)
 4
 5print("--- Taylor Data Bunch ---")
 6print(f"Reference shape: {taylor_data.reference.shape}")
 7print(f"Number of prediction series: {len(taylor_data.predictions)}")
 8print(f"Prediction shapes: {[p.shape for p in taylor_data.predictions]}")
 9print("\nCalculated Stats:")
10print(taylor_data.stats)
11print(f"\nActual Reference Std Dev: {taylor_data.ref_std:.4f}")
Example Output
--- Taylor Data Bunch ---
Reference shape: (50,)
Number of prediction series: 2
Prediction shapes: [(50,), (50,)]

Calculated Stats:
           stddev  corrcoef
Model_A  0.729855  0.835114
Model_B  1.029889  0.508220

Actual Reference Std Dev: 0.9404

Generating Multi-Model Quantile Data

make_multi_model_quantile_data() simulates several models producing quantiles for the same horizon. Each model gets its own median bias and overall interval width, supporting calibration/coverage comparisons across models [1][2].

 1from kdiagram.datasets import make_multi_model_quantile_data
 2
 3# Get as DataFrame
 4df_multi_model = make_multi_model_quantile_data(
 5    n_samples=5, n_models=2, seed=5, as_frame=True,
 6    quantiles=[0.1, 0.5, 0.9]
 7)
 8
 9print("--- Multi-Model Quantile DataFrame ---")
10print(df_multi_model)
Example Output
--- Multi-Model Quantile DataFrame ---
   y_true  feature_1  feature_2  pred_Model_A_q0.1  pred_Model_A_q0.5  pred_Model_A_q0.9  pred_Model_B_q0.1  pred_Model_B_q0.5  pred_Model_B_q0.9
0  50.853502   0.533165   5.108194          43.514661          49.740457          54.158097          36.189075          46.430960          58.077600
1  46.300911   0.639037   1.962088          41.607881          45.545123          51.889254          35.546803          41.932122          51.628643
2  44.874897   0.138801   5.689870          42.241030          44.652911          49.972431          37.209904          42.587300          50.182159
3  52.396877   0.948104   2.990119          45.163347          52.437158          57.719859          45.359873          54.715327          60.382700
4  53.938741   0.776598   5.808982          43.275494          53.397751          61.104506          39.947971          52.309521          63.340564

Generating Regression Data

make_regression_data() is a powerful and flexible generator for creating datasets to test regression model evaluation plots. You can control the ground truth signal, the number of features, and define detailed error profiles for each simulated model.

 1from kdiagram.datasets import make_regression_data
 2
 3# Define profiles for two models with different error characteristics
 4model_profiles = {
 5    "Good Model": {"bias": 0.5, "noise_std": 4.0},
 6    "Biased Model": {"bias": -10.0, "noise_std": 2.0},
 7}
 8
 9# Generate the data as a DataFrame
10df_regression = make_regression_data(
11    model_profiles=model_profiles,
12    seed=42,
13    as_frame=True
14)
15
16print("--- Regression Data Frame ---")
17print(df_regression.head())
Example Output
--- Regression Data Frame ---
      y_true  feature_1  pred_Good_Model  pred_Biased_Model
0  19.917686   6.302826        22.233548           5.414131
1  10.819543   2.272387        14.317278           1.712187
2  24.806819   7.447622        19.778093          12.725647
3  25.401583   7.269946        22.887473          13.559882
4   6.296408   1.034030        12.616590          -3.138418

Generating Classification Data

make_classification_data() creates datasets for binary or multiclass classification problems. It generates features, true class labels, and for each simulated model, both predicted class labels and predicted probabilities. This makes it ideal for testing plots like ROC/PR curves and confusion matrices.

 1from kdiagram.datasets import make_classification_data
 2
 3# Generate data for a 2-class problem with 2 models
 4df_classification = make_classification_data(
 5    n_samples=5,
 6    n_features=2,
 7    n_classes=2,
 8    n_models=2,
 9    seed=42,
10    as_frame=True
11)
12
13print("--- Classification Data Frame ---")
14print(df_classification)
Example Output
--- Classification Data Frame ---
         x1        x2  y        m1        m2
0  1.777792 -0.680930  1  0.659534  0.816292
1 -0.933969  1.222541  0  0.780446  0.705698
2  2.127241 -0.154529  1  0.659211  0.928274
3  1.467509 -0.428328  1  0.544542  0.749182
4  0.140708 -0.352134  1  0.372744  0.366596

Generating Cyclical Data

make_cyclical_data() produces a “true” sinusoid plus one or more phase-shifted / amplitude-scaled prediction series with noise, useful when angle encodes phase (e.g., seasonal cycle). This is convenient for relationship plots and multi-series polar overlays.

 1from kdiagram.datasets import make_cyclical_data
 2
 3# Get as Bunch
 4cycle_bunch = make_cyclical_data(
 5    n_samples=12, n_series=1, cycle_period=12, seed=5,
 6    amplitude_true=5, offset_true=10
 7)
 8
 9print("--- Cyclical Data Bunch ---")
10print(f"Frame shape: {cycle_bunch.frame.shape}")
11print(f"Series names: {cycle_bunch.series_names}")
12print(cycle_bunch.frame[['time_step', 'y_true', 'model_A']].head())
Example Output
--- Cyclical Data Bunch ---
Frame shape: (12, 3)
Series names: ['model_A']
   time_step     y_true    model_A
0          0   9.830655   9.801473
1          1  14.369168  14.775036
2          2  14.989960  15.554347
3          3   9.668771  10.262745
4          4   4.783064   5.812793

Generating Fingerprint Data

make_fingerprint_data() creates a layer × feature matrix of importances with optional sparsity and structure, for plot_feature_fingerprint(). This supports comparisons of feature influence profiles across models or periods—an interpretability aid complementary to verification metrics [2].

 1from kdiagram.datasets import make_fingerprint_data
 2
 3# Get as DataFrame
 4fp_df = make_fingerprint_data(
 5    n_layers=3, n_features=5, seed=303, as_frame=True,
 6    sparsity=0.2, add_structure=True
 7)
 8
 9print("--- Fingerprint Data Frame ---")
10print(fp_df)
Example Output
--- Fingerprint Data Frame ---
           Feature_1  Feature_2  Feature_3  Feature_4  Feature_5
Layer_A     0.941006   0.000000   0.000000   0.000000   0.000000
Layer_B     0.130220   0.870414   0.456472   0.769115   0.322668
Layer_C     0.391512   0.139630   1.022977   0.000000   0.000000

Integrated Plotting Example

This example shows how to generate a dataset using a load_ or make_ function (requesting the DataFrame directly with as_frame=True) and immediately pass it to a relevant k-diagram plotting function. Here, we generate uncertainty data and create an anomaly magnitude plot.

 1import kdiagram as kd
 2import matplotlib.pyplot as plt
 3
 4# 1. Generate data as DataFrame
 5df = kd.datasets.load_uncertainty_data(
 6    as_frame=True,
 7    n_samples=200,
 8    n_periods=1, # Only need first period for this plot
 9    anomaly_frac=0.2, # Ensure anomalies exist
10    prefix="flow",
11    start_year=2024,
12    seed=99
13)
14
15# 2. Create the plot using the generated DataFrame
16ax = kd.plot_anomaly_magnitude(
17    df=df,
18    actual_col='flow_actual',
19    q_cols=['flow_2024_q0.1', 'flow_2024_q0.9'],
20    title="Anomaly Magnitude on Generated Data",
21    cbar=True,
22    savefig="../images/dataset_plot_example_anomaly.png"
23)
24plt.close() # Close plot after saving
Example plot generated from dataset function

Generating Taylor Data and Plotting

This example generates data suitable for Taylor diagrams using make_taylor_data() and plots it using plot_taylor_diagram(). The data is retrieved as a Bunch object, and relevant attributes are passed to the plot function.

 1import kdiagram as kd
 2import matplotlib.pyplot as plt
 3
 4# 1. Generate data as Bunch object
 5taylor_data = kd.datasets.make_taylor_data(
 6    n_models=4,
 7    n_samples=150,
 8    seed=101,
 9    corr_range=(0.6, 0.98),
10    std_range=(0.8, 1.2)
11)
12
13# 2. Create the plot using data from the Bunch
14ax = kd.plot_taylor_diagram(
15    *taylor_data.predictions, # Unpack list of prediction arrays
16    reference=taylor_data.reference,
17    names=taylor_data.model_names,
18    title="Taylor Diagram on Generated Data",
19    acov='half_circle',
20    # Save the plot
21    savefig="../images/dataset_plot_example_taylor.png"
22)
23plt.close() # Close plot after saving
Example Taylor Diagram generated from dataset function

Generating Fingerprint Data and Plotting

This example uses make_fingerprint_data() to generate a feature importance matrix (returned directly as a DataFrame using as_frame=True) and visualizes it with plot_feature_fingerprint().

 1import kdiagram as kd
 2import matplotlib.pyplot as plt
 3
 4# 1. Generate data as DataFrame
 5fp_df = kd.datasets.make_fingerprint_data(
 6    n_layers=4,
 7    n_features=7,
 8    layer_names=['SVM', 'RF', 'MLP', 'XGB'],
 9    feature_names=['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7'],
10    seed=303,
11    as_frame=True, # Get DataFrame directly
12)
13
14# 2. Create the plot using the generated DataFrame
15# plot_feature_fingerprint takes the importance matrix (df/array),
16# features (list/df.columns), and labels (list/df.index)
17ax = kd.plot_feature_fingerprint(
18    importances=fp_df, # Pass DataFrame directly
19    features=fp_df.columns.tolist(), # Get features from columns
20    labels=fp_df.index.tolist(),     # Get labels from index
21    title="Feature Fingerprint on Generated Data",
22    fill=True,
23    cmap='Accent',
24    # Save the plot
25    savefig="../images/dataset_plot_example_fingerprint.png"
26)
27plt.close() # Close plot after saving
Example Feature Fingerprint plot generated from dataset function

Generating Cyclical Data and Plotting Relationship

This example generates data with cyclical patterns using make_cyclical_data() (as a DataFrame) and then plots the relationship between the true values (mapped to angle) and the normalized predictions (mapped to radius) using plot_relationship().

 1import kdiagram as kd
 2import matplotlib.pyplot as plt
 3import numpy as np
 4
 5# 1. Generate cyclical data as DataFrame
 6cycle_df = kd.datasets.make_cyclical_data(
 7    n_samples=365, # Simulate daily data for a year
 8    n_series=2,
 9    cycle_period=365,
10    pred_bias=[0.5, -0.5],
11    pred_phase_shift=[0, np.pi / 12], # Second model lags slightly
12    seed=404,
13    as_frame=True # Get DataFrame directly
14)
15
16# 2. Create the plot using the generated DataFrame
17ax = kd.plot_relationship(
18    cycle_df['y_true'],
19    cycle_df['model_A'], # Access generated prediction columns
20    cycle_df['model_B'],
21    names=['Model A', 'Model B'], # Use generated names
22    title="Relationship Plot on Generated Cyclical Data",
23    theta_scale='uniform', # Use uniform angle spacing (like time steps)
24    acov='default',      # Full circle
25    s=15, alpha=0.6,
26    # Save the plot
27    savefig="../images/dataset_plot_example_cyclical.png"
28)
29plt.close() # Close plot after saving
Example Relationship plot generated from cyclical dataset function

Loading Uncertainty Data for Model Drift Plot

This example generates synthetic multi-period data using load_uncertainty_data() (returned as a Bunch object) and visualizes the uncertainty drift across horizons using plot_model_drift(). The Bunch object makes accessing the required column lists straightforward.

 1import kdiagram as kd
 2import matplotlib.pyplot as plt
 3
 4# 1. Generate data as Bunch object
 5# Generate 5 periods for a clearer drift visual
 6data = kd.datasets.load_uncertainty_data(
 7    as_frame=False, # Get Bunch object
 8    n_samples=100,
 9    n_periods=5,
10    prefix='drift_val',
11    start_year=2020,
12    interval_width_trend=0.8, # Make width increase over time
13    seed=50
14)
15
16# 2. Prepare arguments for the plot function from Bunch attributes
17# Ensure horizon labels match the generated periods
18horizons = [str(data.start_year + i) for i in range(data.n_periods)]
19
20# 3. Create the plot using the generated data and extracted info
21ax = kd.plot_model_drift(
22    df=data.frame,          # The DataFrame within the Bunch
23    q10_cols=data.q10_cols, # List of Q10 columns from Bunch
24    q90_cols=data.q90_cols, # List of Q90 columns from Bunch
25    horizons=horizons,      # Generated horizon labels
26    title="Model Drift on Generated Data",
27    acov='quarter_circle',
28    # Save the plot
29    savefig="../images/dataset_plot_example_drift.png"
30)
31plt.close() # Close plot after saving
Example Model Drift plot generated from dataset function

Zhongshan Data: Interval Consistency Plot (Half Circle)

Load Zhongshan data (as Bunch) and plot interval consistency (using coefficient of variation for radius) restricted to a 180-degree view.

 1import kdiagram as kd
 2import matplotlib.pyplot as plt
 3import warnings
 4import pandas as pd
 5
 6warnings.filterwarnings("ignore", message=".*already exists.*")
 7ax = None
 8try:
 9    # 1. Load data as Bunch
10    data = kd.datasets.load_zhongshan_subsidence(
11        as_frame=False, download_if_missing=True
12        )
13
14    # 2. Check data
15    if (data is not None and hasattr(data, 'frame')
16            and data.q10_cols and data.q50_cols and data.q90_cols):
17        print(f"Plotting interval consistency for Zhongshan.")
18
19        # 3. Create the Interval Consistency plot
20        ax = kd.plot_interval_consistency(
21            df=data.frame,
22            qlow_cols=data.q10_cols,
23            qup_cols=data.q90_cols,
24            q50_cols=data.q50_cols, # Use Q50 for color context
25            use_cv=True,           # Use Coefficient of Variation
26            acov='half_circle',    # <<< Use 180 degree view
27            title="Zhongshan Interval Consistency (CV, 180°)",
28            cmap='Purples',
29            s=15, alpha=0.7,
30            # Save the plot
31            savefig="../images/dataset_plot_example_zhongshan_consistency_half.png"
32        )
33        plt.close()
34    else:
35        print("Loaded data object missing required attributes.")
36
37except FileNotFoundError as e:
38    print(f"ERROR - Zhongshan data not found: {e}")
39except Exception as e:
40    print(f"An unexpected error occurred: {e}")
41
42if ax is None: print("Plot generation skipped.")
Example Interval Consistency plot using Zhongshan data (180 deg)

References