kdiagram.datasets.load_zhongshan_subsidence

kdiagram.datasets.load_zhongshan_subsidence(*, as_frame=False, years=None, quantiles=None, include_coords=True, include_target=True, data_home=None, download_if_missing=True, force_download=False)[source]

Load the Zhongshan land subsidence prediction dataset.

This dataset contains sample multi-period quantile predictions (Q10, Q50, Q90 for 2022–2026) and simulated actual subsidence for 2022 and 2023, along with geographic coordinates for 898 locations in Zhongshan, China. It is intended for demonstrating and testing k-diagram’s uncertainty and evaluation plots and for reproducing examples related to spatiotemporal uncertainty diagnostics [1][2].

The function searches a local cache directory, bundled package resources, and optionally a remote repository (in that order). On success it returns either a pandas DataFrame or a Bunch with convenient attributes.

Parameters:
as_framebool, default=False

If False, return a Bunch that includes the filtered DataFrame plus metadata and sliced arrays (e.g., coordinates, target, and quantile columns). If True, return only the filtered DataFrame.

yearslist of int, optional

Subset to these calendar years (e.g., [2023, 2025]) when selecting target and quantile columns. If None, load all years found in the file (quantiles typically 2022–2026; targets typically 2022/2023).

quantileslist of float, optional

Subset to these quantile levels in [0, 1] (e.g., [0.1, 0.5, 0.9]). If None, load all detected quantiles for the selected years. Defaults to [0.1, 0.5, 0.9].

include_coordsbool, default=True

If True, include coordinate columns 'longitude' and 'latitude' when present.

include_targetbool, default=True

If True, include base target columns (e.g., 'subsidence_2022', 'subsidence_2023') when present and consistent with the requested years.

data_homestr, optional

Directory path for caching datasets. If None, the path is resolved by get_data(). You may also configure the root via the KDIAGRAM_DATA environment variable. Example default is ~/kdiagram_data.

download_if_missingbool, default=True

If True, attempt to download the dataset into the cache when it is not found locally nor in package resources.

force_downloadbool, default=False

If True, attempt to fetch a fresh copy even if a local file exists. Useful to refresh data during development.

Returns:
dataBunch or pandas.DataFrame

If as_frame=False (default) a Bunch with:

  • frame : pandas DataFrame filtered by the request.

  • feature_names : list of included coordinate column names.

  • target_names : list of included target column names.

  • target : NumPy array of target values (or None).

  • longitude, latitude : NumPy arrays when coordinates are included.

  • quantile_cols : dict mapping keys like 'q0.1' to lists of matching column names.

  • q10_cols, q50_cols, q90_cols : convenience lists.

  • years_available, quantiles_available : lists detected in the original file.

  • start_year : smallest year in the loaded subset (if any).

  • n_periods : number of loaded years.

  • DESCR : human-readable dataset description.

If as_frame=True, only the filtered pandas DataFrame is returned.

Raises:
FileNotFoundError

When the dataset cannot be resolved from cache or package resources and either downloading is disabled or the download fails.

ValueError

If requested years or quantiles are invalid or not present in the data file.

Parameters:
Return type:

Bunch | DataFrame

See also

load_uncertainty_data

Generate a synthetic dataset with controllable anomalies and quantiles for testing visual diagnostics.

kdiagram.plot.uncertainty.plot_model_drift
kdiagram.plot.uncertainty.plot_uncertainty_drift
kdiagram.plot.uncertainty.plot_coverage_diagnostic
kdiagram.plot.uncertainty.plot_anomaly_magnitude

Example consumers of this dataset in documentation figures.

Notes

Search order. The loader resolves a file path using the following order: (1) local cache under data_home; (2) installed package resources; (3) optional remote download when download_if_missing=True. You can force step (3) with force_download=True.

Column detection. Quantile columns encode a year \(y\) and a quantile level \(q\) in their names.

(1)\[\text{quantile name} \;\equiv\; \texttt{<prefix>}\_{y}\_\texttt{q}q, \qquad y \in \{2022,\dots,2026\},\; q \in (0,1)\]

Target columns encode only the year \(y\):

(2)\[\text{target name} \;\equiv\; \texttt{subsidence}\_{y}\]

In code, the implementation detects these with the following regular expressions (kept as literals, not math): r"_(\d{4})_q([0-9.]+)$" (quantile columns) and r"_(\d{4})$" (target columns).

This design enables flexible subsetting by year and quantile without hard-coding headers.

Coordinate handling. When present and include_coords=True, the columns 'longitude' and 'latitude' are included and exposed both in the returned frame and as top-level arrays in the Bunch for convenience.

Intended use. The dataset is a compact sample designed for tutorials, documentation figures, and regression tests of k-diagram uncertainty diagnostics [2]. It is not a comprehensive research release.

References

Examples

Basic usage returning a Bunch with metadata:

>>> from kdiagram.datasets import load_zhongshan_subsidence
>>> ds = load_zhongshan_subsidence()
>>> isinstance(ds.frame, type(__import__('pandas').DataFrame()))
True
>>> list(ds.quantile_cols.keys())[:3]
['q0.1', 'q0.5', 'q0.9']
>>>
>>> # Return only the DataFrame and subset to selected years/quantiles:
>>>
>>> df = load_zhongshan_subsidence(
...     as_frame=True, years=[2023, 2025], quantiles=[0.1, 0.9]
... )
>>> set(c.split('_')[-1] for c in df.columns if '_q' in c) <= {'q0.1','q0.9'}
True
>>>
>>> # Force a fresh download into the cache:
>>> _ = load_zhongshan_subsidence(force_download=True)