.. _userguide_datasets:

============
Datasets 
============

The :mod:`kdiagram.datasets` module provides convenient functions
to access sample datasets included with the package (like the
Zhongshan subsidence data) and to generate various synthetic datasets
on the fly.

These datasets are invaluable for:

* Running examples provided in the documentation and gallery.
* Testing `k-diagram`'s plotting functions with predictable data structures.
* Exploring different scenarios of uncertainty, drift, or model comparison.

Most functions allow you to retrieve data either as a standard
:class:`pandas.DataFrame` or as a :class:`~kdiagram.bunch.Bunch` object
(using the ``as_frame`` parameter). The Bunch object conveniently packages
the DataFrame along with metadata like feature/target names, relevant
column lists, and a description of the dataset's origin or generation
parameters.

Function Summary
------------------

.. list-table:: Dataset Loading and Generation Functions
   :widths: 35 65
   :header-rows: 1

   * - Function
     - Description
   * - :func:`~kdiagram.datasets.load_uncertainty_data`
     - Generates synthetic multi-period quantile data with trends,
       noise, and anomalies. Ideal for drift/consistency plots.
   * - :func:`~kdiagram.datasets.load_zhongshan_subsidence`
     - Loads the included Zhongshan subsidence prediction sample dataset.
   * - :func:`~kdiagram.datasets.make_taylor_data`
     - Generates a reference series and multiple prediction series with
       controlled correlation/standard deviation for Taylor Diagrams.
   * - :func:`~kdiagram.datasets.make_multi_model_quantile_data`
     - Generates quantile predictions from multiple simulated models
       for a single time period. Useful for model comparison plots.
   * - :func:`~kdiagram.datasets.make_cyclical_data`
     - Generates data with true and predicted series exhibiting
       cyclical/seasonal patterns.
   * - :func:`~kdiagram.datasets.make_fingerprint_data`
     - Generates a synthetic feature importance matrix for feature
       fingerprint (radar) plots.

.. raw:: html

    <hr>
    
Usage Examples
----------------

Below are examples demonstrating how to use each function.

Loading Synthetic Uncertainty Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generates multi-period quantile data, returned as a Bunch object
by default.

.. code-block:: python
   :linenos:

   from kdiagram.datasets import load_uncertainty_data

   # Generate as Bunch (default)
   data_bunch = load_uncertainty_data(
       n_samples=10, n_periods=2, seed=1, prefix="flow"
       )

   print("--- Bunch Object ---")
   print(f"Keys: {list(data_bunch.keys())}")
   print(f"Description:\n{data_bunch.DESCR[:200]}...") # Print start of DESCR
   print("\nDataFrame Head:")
   print(data_bunch.frame.head(3))
   print("\nQ10 Columns:")
   print(data_bunch.q10_cols)

.. code-block:: text
   :caption: Example Output (Structure)

   --- Bunch Object ---
   Keys: ['frame', 'feature_names', 'target_names', 'target', 'quantile_cols', 'q10_cols', 'q50_cols', 'q90_cols', 'n_periods', 'prefix', 'start_year', 'DESCR']
   Description:
   Synthetic Multi-Period Uncertainty Dataset for k-diagram

   **Description:**
   Generates synthetic data simulating quantile forecasts (Q10,
   Q50, Q90) for 'flow' over 2 periods starting
   from 2022 across 10 samples/lo...

   DataFrame Head:
      location_id  longitude   latitude   elevation  flow_actual  ...
   0            0 -116.8388    35.094262  366.807627    16.816179  ...
   1            1 -117.8696    34.045590  247.216119     9.508103  ...
   2            2 -119.749534  35.488999  353.628218     5.439137  ...

   Q10 Columns:
   ['flow_2022_q0.1', 'flow_2023_q0.1']


Loading Zhongshan Subsidence Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Loads the packaged sample dataset. This example loads it as a
DataFrame and selects only data for specific years and quantiles.

.. code-block:: python
   :linenos:

   from kdiagram.datasets import load_zhongshan_subsidence
   import warnings

   # Suppress potential download warnings if data exists locally
   warnings.filterwarnings("ignore", message=".*already exists.*")

   # Load as DataFrame, subsetting years and quantiles
   try:
       df_zhongshan_subset = load_zhongshan_subsidence(
           as_frame=True,
           years=[2023, 2025],
           quantiles=[0.1, 0.9],
           include_target=False, # Exclude 'subsidence_YYYY' cols
           download_if_missing=True # Allow download if not packaged/cached
       )
       print("Loaded Zhongshan Subset DataFrame:")
       print(df_zhongshan_subset.head(3))
       print("\nColumns:")
       print(df_zhongshan_subset.columns)

   except FileNotFoundError as e:
       print(f"Error loading Zhongshan data: {e}")
       print("Ensure the package data was installed correctly or "
             "download is enabled/possible.")
   except Exception as e:
        print(f"An unexpected error occurred: {e}")

.. code-block:: text
   :caption: Example Output (Structure, assuming load successful)

   Loaded Zhongshan Subset DataFrame:
        longitude   latitude  subsidence_2023_q0.1  subsidence_2023_q0.9  subsidence_2025_q0.1  subsidence_2025_q0.9
   0   113.237984  22.494591              ...              ...              ...              ...
   1   113.220802  22.513592              ...              ...              ...              ...
   2   113.225632  22.530231              ...              ...              ...              ...

   Columns:
   Index(['longitude', 'latitude', 'subsidence_2023_q0.1',
          'subsidence_2023_q0.9', 'subsidence_2025_q0.1',
          'subsidence_2025_q0.9'], dtype='object')


Generating Taylor Diagram Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Uses :func:`~kdiagram.datasets.make_taylor_data` to generate a
reference series and multiple prediction series suitable for Taylor
diagrams. Returns a Bunch containing arrays and calculated stats.

.. code-block:: python
   :linenos:

   from kdiagram.datasets import make_taylor_data

   taylor_data = make_taylor_data(n_models=2, n_samples=50, seed=101)

   print("--- Taylor Data Bunch ---")
   print(f"Reference shape: {taylor_data.reference.shape}")
   print(f"Number of prediction series: {len(taylor_data.predictions)}")
   print(f"Prediction shapes: {[p.shape for p in taylor_data.predictions]}")
   print("\nCalculated Stats:")
   print(taylor_data.stats)
   print(f"\nActual Reference Std Dev: {taylor_data.ref_std:.4f}")

.. code-block:: text
   :caption: Example Output

   --- Taylor Data Bunch ---
   Reference shape: (50,)
   Number of prediction series: 2
   Prediction shapes: [(50,), (50,)]

   Calculated Stats:
              stddev  corrcoef
   Model_A  0.729855  0.835114
   Model_B  1.029889  0.508220

   Actual Reference Std Dev: 0.9404


Generating Multi-Model Quantile Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Uses :func:`~kdiagram.datasets.make_multi_model_quantile_data` to
simulate quantile predictions from different models for the same
target variable.

.. code-block:: python
   :linenos:

   from kdiagram.datasets import make_multi_model_quantile_data

   # Get as DataFrame
   df_multi_model = make_multi_model_quantile_data(
       n_samples=5, n_models=2, seed=5, as_frame=True,
       quantiles=[0.1, 0.5, 0.9]
   )

   print("--- Multi-Model Quantile DataFrame ---")
   print(df_multi_model)

.. code-block:: text
   :caption: Example Output

   --- Multi-Model Quantile DataFrame ---
      y_true  feature_1  feature_2  pred_Model_A_q0.1  pred_Model_A_q0.5  pred_Model_A_q0.9  pred_Model_B_q0.1  pred_Model_B_q0.5  pred_Model_B_q0.9
   0  50.853502   0.533165   5.108194          43.514661          49.740457          54.158097          36.189075          46.430960          58.077600
   1  46.300911   0.639037   1.962088          41.607881          45.545123          51.889254          35.546803          41.932122          51.628643
   2  44.874897   0.138801   5.689870          42.241030          44.652911          49.972431          37.209904          42.587300          50.182159
   3  52.396877   0.948104   2.990119          45.163347          52.437158          57.719859          45.359873          54.715327          60.382700
   4  53.938741   0.776598   5.808982          43.275494          53.397751          61.104506          39.947971          52.309521          63.340564


Generating Cyclical Data
~~~~~~~~~~~~~~~~~~~~~~~~~~
Uses :func:`~kdiagram.datasets.make_cyclical_data` to create time
series with seasonal or cyclical patterns, useful for visualizing
relationships where angle represents phase.

.. code-block:: python
   :linenos:

   from kdiagram.datasets import make_cyclical_data

   # Get as Bunch
   cycle_bunch = make_cyclical_data(
       n_samples=12, n_series=1, cycle_period=12, seed=5,
       amplitude_true=5, offset_true=10
   )

   print("--- Cyclical Data Bunch ---")
   print(f"Frame shape: {cycle_bunch.frame.shape}")
   print(f"Series names: {cycle_bunch.series_names}")
   print(cycle_bunch.frame[['time_step', 'y_true', 'model_A']].head())

.. code-block:: text
   :caption: Example Output

   --- Cyclical Data Bunch ---
   Frame shape: (12, 3)
   Series names: ['model_A']
      time_step     y_true    model_A
   0          0   9.830655   9.801473
   1          1  14.369168  14.775036
   2          2  14.989960  15.554347
   3          3   9.668771  10.262745
   4          4   4.783064   5.812793


Generating Fingerprint Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Uses :func:`~kdiagram.datasets.make_fingerprint_data` to generate
a matrix of feature importances across multiple layers, suitable
for :func:`~kdiagram.plot.feature_based.plot_feature_fingerprint`.

.. code-block:: python
   :linenos:

   from kdiagram.datasets import make_fingerprint_data

   # Get as DataFrame
   fp_df = make_fingerprint_data(
       n_layers=3, n_features=5, seed=303, as_frame=True,
       sparsity=0.2, add_structure=True
   )

   print("--- Fingerprint Data Frame ---")
   print(fp_df)

.. code-block:: text
   :caption: Example Output

   --- Fingerprint Data Frame ---
              Feature_1  Feature_2  Feature_3  Feature_4  Feature_5
   Layer_A     0.941006   0.000000   0.000000   0.000000   0.000000
   Layer_B     0.130220   0.870414   0.456472   0.769115   0.322668
   Layer_C     0.391512   0.139630   1.022977   0.000000   0.000000

.. raw:: html

    <hr>
    
Integrated Plotting Example
------------------------------

This example shows how to generate a dataset using a `load_` or
`make_` function (requesting the DataFrame directly with
``as_frame=True``) and immediately pass it to a relevant `k-diagram`
plotting function. Here, we generate uncertainty data and create an
anomaly magnitude plot.

.. code-block:: python
   :linenos:

   import kdiagram as kd 
   import matplotlib.pyplot as plt

   # 1. Generate data as DataFrame
   df = kd.datasets.load_uncertainty_data(
       as_frame=True,
       n_samples=200,
       n_periods=1, # Only need first period for this plot
       anomaly_frac=0.2, # Ensure anomalies exist
       prefix="flow",
       start_year=2024,
       seed=99
   )

   # 2. Create the plot using the generated DataFrame
   ax = kd.plot_anomaly_magnitude(
       df=df,
       actual_col='flow_actual',
       q_cols=['flow_2024_q0.1', 'flow_2024_q0.9'],
       title="Anomaly Magnitude on Generated Data",
       cbar=True,
       savefig="../images/dataset_plot_example_anomaly.png"
   )
   plt.close() # Close plot after saving

.. image:: ../images/dataset_plot_example_anomaly.png
   :alt: Example plot generated from dataset function
   :align: center
   :width: 75%

.. topic:: 🧠 Analysis and Interpretation
   :class: hint

   This **Anomaly Magnitude Plot** visualizes the errors from the
   synthetic dataset generated by
   :func:`~kdiagram.datasets.load_uncertainty_data`. Only points where
   the 'actual' value falls outside the [Q10, Q90] interval are shown.

   **Analysis and Interpretation:**

   * **Angle (θ):** Represents the index of the generated sample
     (0 to 199), distributed around the circle.
   * **Radius (r):** Shows the **magnitude** of the anomaly – how far
     the ``flow_actual`` value was from the closest bound
     (``flow_2024_q0.1`` or ``flow_2024_q0.9``). Larger radii indicate
     more severe prediction interval failures.
   * **Color:** Distinguishes between **under-predictions** (actual < Q10,
     shown in blues by default and in the legend) and
     **over-predictions** (actual > Q90, shown in reds by default and
     in the legend). The **intensity** of the color, indicated by the
     colorbar, also reflects the anomaly magnitude (radius).

   **🔍 Key Insights from this Example:**

   * The presence of both blue and red points confirms that the
     data generation process successfully created both under- and
     over-prediction anomalies as requested by ``anomaly_frac=0.2``.
   * The points are scattered across various angles, suggesting the
     anomalies were introduced randomly across the samples, without a
     strong angular (index-based) pattern in this synthetic dataset.
   * The radii vary, with some points near the center (small anomaly
     magnitude) and others further out (larger magnitude, up to ~8
     units according to the color bar), indicating a range of error
     severities was generated.

   **💡 Connection to Data Generation:**

   * ``n_samples=200`` created 200 potential points around the circle.
   * ``anomaly_frac=0.2`` aimed to make ~40 points appear as anomalies.
   * ``prefix="flow"`` and ``start_year=2024`` determined the column
     names (`flow_actual`, `flow_2024_q0.1`, `flow_2024_q0.9`)
     required by the plotting function call.
   * The range of radii (anomaly magnitudes) seen reflects the random
     deviations introduced during the synthetic anomaly generation step
     within the ``load_uncertainty_data`` function.
     
.. raw:: html

    <hr>
     
Generating Taylor Data and Plotting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example generates data suitable for Taylor diagrams using
:func:`~kdiagram.datasets.make_taylor_data` and plots it using
:func:`~kdiagram.plot.evaluation.plot_taylor_diagram`. The data is
retrieved as a Bunch object, and relevant attributes are passed to the
plot function.

.. code-block:: python
   :linenos:

   import kdiagram as kd 
   import matplotlib.pyplot as plt

   # 1. Generate data as Bunch object
   taylor_data = kd.datasets.make_taylor_data(
       n_models=4,
       n_samples=150,
       seed=101,
       corr_range=(0.6, 0.98),
       std_range=(0.8, 1.2)
   )

   # 2. Create the plot using data from the Bunch
   # Assuming plot function is kd.plot_taylor_diagram
   ax = kd.plot_taylor_diagram(
       *taylor_data.predictions, # Unpack list of prediction arrays
       reference=taylor_data.reference,
       names=taylor_data.model_names,
       title="Taylor Diagram on Generated Data",
       acov='half_circle',
       # Save the plot
       savefig="../images/dataset_plot_example_taylor.png"
   )
   plt.close() # Close plot after saving

.. image:: ../images/dataset_plot_example_taylor.png
   :alt: Example Taylor Diagram generated from dataset function
   :align: center
   :width: 75%


.. topic:: 🧠 Analysis and Interpretation
   :class: hint

   This example first uses
   :func:`~kdiagram.datasets.make_taylor_data` to generate a
   reference dataset and four simulated model prediction datasets
   with varying statistical properties. It then visualizes these
   using :func:`~kdiagram.plot.evaluation.plot_taylor_diagram`.

   **Analysis and Interpretation:**

   * **Axes & Reference:** The plot displays standard deviation as the
     radial distance from the origin and correlation as the angle
     (decreasing clockwise from the left 'W' axis, where Corr=1.0).
     The red arc represents the standard deviation of the reference
     data (which is approximately 1.0).
   * **Model Performance:** Each colored dot represents a model:
      * **Model A (Red):** High correlation (~0.9) and standard
        deviation slightly less than the reference (~0.9). It captures
        the pattern well but slightly underestimates variability.
      * **Model B (Purple):** Lower correlation (~0.7) and much higher
        standard deviation (~1.3). It matches the pattern less well
        and overestimates variability.
      * **Model C (Brown):** Good correlation (~0.8) but lower
        standard deviation (~0.8). Captures the pattern reasonably
        but underestimates variability.
      * **Model D (Grey):** Similar correlation to Model B (~0.75) but
        lower standard deviation (~0.85), closer to Model A/C in
        variability.
   * **Overall Skill (RMSD):** The distance from each model point to
     the reference point on the arc (at Corr=1.0, StdDev=1.0)
     indicates the centered RMS difference. Model C appears closest,
     followed perhaps by Model A, suggesting they have the best
     overall balance in this simulation. Model B is clearly the
     furthest (worst RMSD).

   **💡 Connection to Data Generation:**

   * The spread of points reflects the target ranges set in
     `make_taylor_data`: ``corr_range=(0.6, 0.98)`` and
     ``std_range=(0.8, 1.2)``. The function successfully generated
     models whose actual statistics fall within or near these target
     ranges relative to the reference standard deviation of ~1.0.
   * This demonstrates how the generation function can create diverse
     scenarios for testing how different models might appear on a
     Taylor Diagram.
     
.. raw:: html

    <hr>
        
Generating Fingerprint Data and Plotting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example uses :func:`~kdiagram.datasets.make_fingerprint_data`
to generate a feature importance matrix (returned directly as a
DataFrame using ``as_frame=True``) and visualizes it with
:func:`~kdiagram.plot.feature_based.plot_feature_fingerprint`.

.. code-block:: python
   :linenos:

   import kdiagram as kd 
   import matplotlib.pyplot as plt

   # 1. Generate data as DataFrame
   fp_df = kd.datasets.make_fingerprint_data(
       n_layers=4,
       n_features=7,
       layer_names=['SVM', 'RF', 'MLP', 'XGB'],
       feature_names=['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7'],
       seed=303,
       as_frame=True, # Get DataFrame directly
   )

   # 2. Create the plot using the generated DataFrame
   # plot_feature_fingerprint takes the importance matrix (df/array),
   # features (list/df.columns), and labels (list/df.index)
   ax = kd.plot_feature_fingerprint(
       importances=fp_df, # Pass DataFrame directly
       features=fp_df.columns.tolist(), # Get features from columns
       labels=fp_df.index.tolist(),     # Get labels from index
       title="Feature Fingerprint on Generated Data",
       fill=True,
       cmap='Accent',
       # Save the plot
       savefig="../images/dataset_plot_example_fingerprint.png"
   )
   plt.close() # Close plot after saving

.. image:: ../images/dataset_plot_example_fingerprint.png
   :alt: Example Feature Fingerprint plot generated from dataset function
   :align: center
   :width: 75%

.. topic:: 🧠 Analysis and Interpretation
   :class: hint

   This **Feature Importance Fingerprint** plot uses a radar chart
   to compare the importance profiles of 7 features (F1-F7) across
   4 simulated models (SVM, RF, MLP, XGB), generated using
   :func:`~kdiagram.datasets.make_fingerprint_data`.

   **Analysis and Interpretation:**

   * **Axes:** Each axis radiating from the center corresponds to one
     of the features (F1 through F7).
   * **Polygons (Layers):** Each colored, filled polygon represents
     one model, as indicated by the legend.
   * **Radius (Normalized Importance):** The distance from the center
     along a feature's axis indicates the *relative importance* of
     that feature *for that specific model*. Since normalization is
     applied per model (the default ``normalize=True`` was used here),
     the radius scales from 0 to 1 (maximum importance *for that model*).
   * **Shape ("Fingerprint"):** The overall shape of each polygon
     provides a distinct "fingerprint", showing which features are
     most influential for each model relative to its own other features.

   **🔍 Key Insights from this Example:**

   * **Distinct Profiles:** Each model clearly relies on different
     features. For instance:
        * **SVM (Green):** Primarily driven by F3, with some
          contribution from F1 and F2.
        * **RF (Orange):** Shows high relative importance for F1 and
          F6, moderate for F2.
        * **MLP (Blue):** Relies most heavily on F3 and F5.
        * **XGB (Brown):** Dominated by F4, with moderate importance
          for F2, F3, and F5.
   * **Feature Comparison:** We can compare feature relevance *across*
     models. F3 is important for SVM, MLP, and XGB, but not RF. F7
     appears relatively unimportant for all models shown. F1 is crucial
     for RF but less so for others.
   * **Normalization Effect:** Because normalization was used, we are
     comparing the *patterns* of importance. We cannot directly compare
     the absolute importance score of F3 for SVM vs. F3 for MLP from
     this plot alone (use ``normalize=False`` for that).

   **💡 Connection to Data Generation:**

   * The number of axes (7) and polygons (4) match the `n_features`
     and `n_layers` parameters passed to `make_fingerprint_data`.
   * The distinct shapes reflect the `add_structure=True` (default)
     setting in the generator, which aims to make fingerprints differ.
   * The radius scaling to 1.0 for each polygon's maximum point is due
     to `normalize=True` being active.

.. raw:: html

    <hr>


Generating Cyclical Data and Plotting Relationship
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example generates data with cyclical patterns using
:func:`~kdiagram.datasets.make_cyclical_data` (as a DataFrame) and
then plots the relationship between the true values (mapped to angle)
and the normalized predictions (mapped to radius) using
:func:`~kdiagram.plot.relationship.plot_relationship`.

.. code-block:: python
   :linenos:

   import kdiagram as kd # Assuming top-level access or specific imports
   import matplotlib.pyplot as plt
   import numpy as np 

   # 1. Generate cyclical data as DataFrame
   cycle_df = kd.datasets.make_cyclical_data(
       n_samples=365, # Simulate daily data for a year
       n_series=2,
       cycle_period=365,
       pred_bias=[0.5, -0.5],
       pred_phase_shift=[0, np.pi / 12], # Second model lags slightly
       seed=404,
       as_frame=True # Get DataFrame directly
   )

   # 2. Create the plot using the generated DataFrame
   ax = kd.plot_relationship(
       cycle_df['y_true'],
       cycle_df['model_A'], # Access generated prediction columns
       cycle_df['model_B'],
       names=['Model A', 'Model B'], # Use generated names
       title="Relationship Plot on Generated Cyclical Data",
       theta_scale='uniform', # Use uniform angle spacing (like time steps)
       acov='default',      # Full circle
       s=15, alpha=0.6,
       # Save the plot
       savefig="../images/dataset_plot_example_cyclical.png"
   )
   plt.close() # Close plot after saving

.. image:: ../images/dataset_plot_example_cyclical.png
   :alt: Example Relationship plot generated from cyclical dataset function
   :align: center
   :width: 75%

.. topic:: 🧠 Analysis and Interpretation
   :class: hint

   This plot visualizes the relationship between a synthetically
   generated cyclical 'true' signal and predictions from two models
   (Model A, Model B), created using
   :func:`~kdiagram.datasets.make_cyclical_data`. The plot uses
   :func:`~kdiagram.plot.relationship.plot_relationship`.

   **Analysis and Interpretation:**

   * **Angle (θ):** Represents the **time step index** (0 to 364),
     distributed uniformly around the full 360 degrees because
     ``theta_scale='uniform'`` was used. It does *not* directly
     represent the magnitude of `y_true` in this case.
   * **Radius (r):** Represents the **normalized predicted value** for
     each model, scaled independently to the range [0, 1]. Radius=1
     corresponds to the maximum prediction *for that specific model*,
     and Radius=0 corresponds to its minimum prediction.
   * **Colors:** Distinguish Model A (blue-grey) from Model B
     (brown-orange).

   **🔍 Key Insights from this Example:**

   * **Cyclical Patterns:** Both models clearly exhibit cyclical
     behavior, forming distinct orbital patterns, reflecting the
     underlying sine wave generated by `make_cyclical_data`.
   * **Phase Shift:** Model B's pattern appears slightly rotated
     clockwise relative to Model A's pattern. This visualizes the
     `pred_phase_shift` introduced during data generation, where
     Model B was made to lag Model A.
   * **Normalization Effect:** The radial positions show the relative
     level of each prediction *within its own range*. We can compare
     if Model A is at its peak (radius near 1) at the same time step
     (angle) as Model B is at its peak.
   * **Bias Effect:** The slight difference in the average radial
     position between the two models might reflect the different
     `pred_bias` values applied during generation.

   **💡 When to Use:**

   * **Visualize Cyclical Relationships:** Ideal when `y_true` (or the
     variable mapped to angle) represents a cyclical process like
     time of day, day of year, or phase angle.
   * **Compare Normalized Model Responses:** Useful for comparing the
     *relative* pattern or timing of different model predictions over
     a cycle or sequence, even if their absolute scales differ, thanks
     to the independent radial normalization.
   * **Identify Lags/Leads:** Phase differences between prediction
     series become visually apparent as angular offsets.

.. raw:: html

    <hr>

Loading Uncertainty Data for Model Drift Plot
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example generates synthetic multi-period data using
:func:`~kdiagram.datasets.load_uncertainty_data` (returned as a Bunch
object) and visualizes the uncertainty drift across horizons using
:func:`~kdiagram.plot.uncertainty.plot_model_drift`. The Bunch object
makes accessing the required column lists straightforward.

.. code-block:: python
   :linenos:

   import kdiagram as kd # Assuming plots and datasets accessible
   import matplotlib.pyplot as plt

   # 1. Generate data as Bunch object
   # Generate 5 periods for a clearer drift visual
   data = kd.datasets.load_uncertainty_data(
       as_frame=False, # Get Bunch object
       n_samples=100,
       n_periods=5,
       prefix='drift_val',
       start_year=2020,
       interval_width_trend=0.8, # Make width increase over time
       seed=50
   )

   # 2. Prepare arguments for the plot function from Bunch attributes
   # Ensure horizon labels match the generated periods
   horizons = [str(data.start_year + i) for i in range(data.n_periods)]

   # 3. Create the plot using the generated data and extracted info
   ax = kd.plot_model_drift(
       df=data.frame,          # The DataFrame within the Bunch
       q10_cols=data.q10_cols, # List of Q10 columns from Bunch
       q90_cols=data.q90_cols, # List of Q90 columns from Bunch
       horizons=horizons,      # Generated horizon labels
       title="Model Drift on Generated Data",
       acov='quarter_circle',
       # Save the plot
       savefig="../images/dataset_plot_example_drift.png"
   )
   plt.close() # Close plot after saving

.. image:: ../images/dataset_plot_example_drift.png
   :alt: Example Model Drift plot generated from dataset function
   :align: center
   :width: 70%

.. topic:: 🧠 Analysis and Interpretation
   :class: hint

   This example uses :func:`~kdiagram.datasets.load_uncertainty_data`
   to generate synthetic data simulating increasing interval widths
   over 5 periods (2020-2024). The resulting DataFrame and column
   lists (extracted from the Bunch object) are then passed to
   :func:`~kdiagram.plot.uncertainty.plot_model_drift` to visualize
   this trend.

   **Analysis and Interpretation:**

   * **Plot Type:** A polar bar chart confined to a 90-degree arc
     (``acov='quarter_circle'``).
   * **Angle (θ):** Each position corresponds to a forecast horizon,
     labeled here with the years 2020 through 2024.
   * **Radius (r):** The length of each bar represents the **average
     prediction interval width** (mean of Q90 - Q10) calculated
     across all samples *for that specific year*.
   * **Color:** Bars are colored using the default `coolwarm` map,
     transitioning from cool (blue) for lower radial values to warm
     (red) for higher values.
   * **Annotations:** The number above each bar shows the calculated
     mean interval width for that horizon.

   **🔍 Key Insights from this Example:**

   * **Increasing Uncertainty:** The bars clearly get taller (larger
     radius) moving clockwise from 2020 to 2024. This visually
     confirms the positive **drift** in average uncertainty.
   * **Quantified Drift:** The annotations show the mean width
     increasing steadily from ~3.97 in 2020 to ~7.12 in 2024.
   * **Color Reinforcement:** The color shift from blue towards red
     also indicates the increasing magnitude of the average interval
     width across the horizons.

   **💡 Connection to Data Generation:**

   * The clear increase in bar height is a direct result of setting
     ``interval_width_trend=0.8`` when calling
     ``load_uncertainty_data``. This parameter caused the synthetic
     interval widths to widen, on average, for each subsequent period.
   * The labels 2020-2024 correspond correctly to ``start_year=2020``
     and ``n_periods=5``.
   * The use of the Bunch object simplified plotting by providing
     pre-parsed lists ``data.q10_cols`` and ``data.q90_cols``.

.. raw:: html

    <hr>

Zhongshan Data: Interval Consistency Plot (Half Circle)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Load Zhongshan data (as Bunch) and plot interval consistency (using
coefficient of variation for radius) restricted to a 180-degree view.

.. code-block:: python
   :linenos:

   import kdiagram as kd
   import matplotlib.pyplot as plt
   import warnings
   import pandas as pd

   warnings.filterwarnings("ignore", message=".*already exists.*")
   ax = None
   try:
       # 1. Load data as Bunch
       data = kd.datasets.load_zhongshan_subsidence(
           as_frame=False, download_if_missing=True
           )

       # 2. Check data
       if (data is not None and hasattr(data, 'frame')
               and data.q10_cols and data.q50_cols and data.q90_cols):
           print(f"Plotting interval consistency for Zhongshan.")

           # 3. Create the Interval Consistency plot
           ax = kd.plot_interval_consistency(
               df=data.frame,
               qlow_cols=data.q10_cols,
               qup_cols=data.q90_cols,
               q50_cols=data.q50_cols, # Use Q50 for color context
               use_cv=True,           # Use Coefficient of Variation
               acov='half_circle',    # <<< Use 180 degree view
               title="Zhongshan Interval Consistency (CV, 180°)",
               cmap='Purples',
               s=15, alpha=0.7,
               # Save the plot
               savefig="../images/dataset_plot_example_zhongshan_consistency_half.png"
           )
           plt.close()
       else:
           print("Loaded data object missing required attributes.")

   except FileNotFoundError as e:
       print(f"ERROR - Zhongshan data not found: {e}")
   except Exception as e:
       print(f"An unexpected error occurred: {e}")

   if ax is None: print("Plot generation skipped.")

.. image:: ../images/dataset_plot_example_zhongshan_consistency_half.png
   :alt: Example Interval Consistency plot using Zhongshan data (180 deg)
   :align: center
   :width: 75%

.. topic:: 🧠 Analysis and Interpretation
   :class: hint

   This plot uses
   :func:`~kdiagram.plot.uncertainty.plot_interval_consistency`
   to show the **stability of prediction interval widths** (Q90-Q10)
   over time (2022-2026) for the Zhongshan sample dataset. The
   angular coverage is set to 180 degrees (``acov='half_circle'``).

   **Analysis and Interpretation:**

   * **Angle (θ):** Represents the sample index (location 0-897),
     mapped linearly onto the top half of the circle (0° to 180°).
   * **Radius (r):** Shows the **Coefficient of Variation (CV)** of
     the interval width across the years for each location. A higher
     radius signifies greater *relative* inconsistency in the
     predicted uncertainty width over time.
   * **Color:** Represents the **average Q50** (median subsidence
     prediction) across all years for each location, using the
     `Purples` colormap (lighter = lower avg Q50, darker = higher
     avg Q50), as shown by the color bar.

   **🔍 Key Insights from this Example:**

   * **Dominant Consistency:** Similar to the previous consistency
     plot (which used a narrower angle), the overwhelming majority
     of locations cluster very close to the origin (radius near 0).
     This indicates **very high consistency** (low CV) in the
     predicted interval widths over the 5-year period for most
     sample points.
   * **Identified Outliers:** A small number of distinct outlier
     points are visible at much larger radii (CVs > 20), indicating
     locations where the model's uncertainty prediction is highly
     variable across the years relative to its average width.
   * **Color Context:** The dense cluster near the center mostly shows
     lighter purple shades, suggesting that the highly consistent
     predictions often correspond to areas with lower average Q50
     subsidence values. The few high-CV outliers show a mix of colors.
   * **Effect of `acov`:** Compared to an `eighth_circle`, the
     `half_circle` view displays roughly four times as many locations,
     confirming the pattern holds across a larger sample subset.

   **💡 Use Case Connection:**

   * This reinforces the finding that while the uncertainty estimate
     is stable for most locations in the sample, specific outlier
     locations exist where the model's uncertainty predictions are
     erratic over time and require scrutiny.
   * Decision-makers might trust the uncertainty bounds more in the
     low-CV cluster, especially where average predicted subsidence
     (color) is also low.

.. raw:: html

    <hr>