Generative deep learning improves reconstruction of global historical climate records

Zhen Qian; Teng Liu; Sebastian Bathiany; Shangshang Yang; Philipp Hess; Nils Bochow; Christian Burmester; Maximilian Gelbrecht; Brian Groenke; Niklas Boers

Generative deep learning improves reconstruction of global historical climate records

Zhen Qian, Teng Liu, Sebastian Bathiany, Shangshang Yang, Philipp Hess, Nils Bochow, Christian Burmester, Maximilian Gelbrecht, Brian Groenke, Niklas Boers

TL;DR

This work presents a unified, probabilistic generative deep learning framework that overcomes limitations and reveals previously unresolved historical climate variability back to 1850, and preserves the higher-order statistics of climate dynamics, transforming reconstruction into a robust uncertainty-aware assessment.

Abstract

Accurate assessment of anthropogenic climate change relies on historical instrumental data, yet observations from the early 20th century are sparse, fragmented, and uncertain. Conventional reconstructions rely on disparate statistical interpolation, which excessively smooths local features and creates unphysical artifacts, leading to systematic underestimation of intrinsic variability and extremes. Here, we present a unified, probabilistic generative deep learning framework that overcomes these limitations and reveals previously unresolved historical climate variability back to 1850. Leveraging a learned generative prior of Earth system dynamics, our model performs probabilistic inference to recover spatiotemporally consistent historical temperature and precipitation fields from sparse observations. Our approach preserves the higher-order statistics of climate dynamics, transforming reconstruction into a robust uncertainty-aware assessment. We demonstrate that our reconstruction overcomes pronounced biases in widely used historical reference products, including those underlying IPCC assessments, especially regarding extreme weather events. Notably, we uncover higher early 20th-century global warming levels compared to existing reconstructions, primarily driven by more pronounced polar warming, with mean Arctic warming trends exceeding established benchmarks by 0.15--0.29°C per decade for 1900--1980. Conversely, for the modern era, our reconstruction indicates that the broad Arctic warming trend is likely overestimated in recent assessments, yet explicitly resolves previously unrecognized intense, localized hotspots in the Barents Sea and Northeastern Greenland. Furthermore, based on our seamless global reconstruction that recovers precipitation variability across the oceans and under-monitored regions, we uncover an intensification of the global hydrological cycle.

Generative deep learning improves reconstruction of global historical climate records

TL;DR

Abstract

Paper Structure (12 sections, 11 equations, 30 figures, 4 tables)

This paper contains 12 sections, 11 equations, 30 figures, 4 tables.

Introduction
Results
Generative reconstruction of climate dynamics
Assessing reconstruction of dynamics with a real-world observation network
Reconstructing historical temperature and precipitation fields
Informing climate assessments
Discussion
Methods
Probabilistic generative reconstruction framework
Training data and preprocessing
Benchmarking and evaluation
Reconstruction of IPCC AR6 reference datasets

Figures (30)

Figure 1: Reconstruction of global temperature anomalies (5 °) from synthetic sparse input.a, Global snapshots of a representative monthly anomaly (ERA5 test set) reconstructed from 5% random observational coverage. The diffusion model (DM) panels display a single generated realization (N = 1) for both the high-fidelity (DM-Fid) and ensemble (DM-Ens) configurations, compared with deterministic baselines (LaMa, Kriging). b, Local analysis at a representative polar location (90°N, 0°). The left subpanel illustrates the temporal variability of the reconstruction against the ground truth, while the right subpanel displays the residual errors (reconstruction minus ground truth) for each method. For DM-Fid and DM-Ens, a single ensemble member is plotted to demonstrate the variability. Black dots indicate the sparse observational constraints available to the models. c, Quantitative evaluation of reconstruction accuracy as a function of data sparsity. Performance is assessed using temporal coherence (mean temporal correlation coefficient, TCC) and spatial accuracy (mean spatial normalized root mean square error, nRMSE) across varying missing-data ratios. Metrics are derived from ensemble means (N = 5 for DM-Fid; N = 50 for DM-Ens) to ensure robust comparison with deterministic baselines.
Figure 2: Reconstruction of global precipitation anomalies (0.5 °) from synthetic sparse input.a, Global snapshots of a representative monthly anomaly (ERA5 test set) reconstructed from 1% random observational coverage. Maps display single-member reconstructions (DM-Fid, DM-Ens) from the generative models alongside deterministic baselines (LaMa, ADW). b, Hovmöller diagrams of tropical precipitation anomalies (averaged over 15°S--15°N). The comparison between Ground Truth, a single DM-Fid member, and ADW demonstrates our model's ability to reconstruct realistic spatiotemporal variability and extreme values that are otherwise obscured by the ADW interpolation. c, Zonal mean of reconstructed precipitation anomalies, averaged over the full time period. d, Temporally averaged spatial probability density function (PDF), computed for each time step and then averaged. e, Temporally averaged Power Spectral Density (PSD), showing the mean energy distribution across spatial scales. In panels c--e, DM-Fid (N = 5) and DM-Ens (N = 50) are compared to the benchmarks.
Figure 3: Evaluation of historical reconstruction using realistic HadCRUT5 coverage. All panels use a completely held-out CMIP6 ensemble member (CESM2 r5i1p1f1, 1850--2014) as the ground truth, with masking based on the historical HadCRUT5 observational coverage. a, b, Spatial pattern of the time-averaged missing data ratio and its corresponding global mean temporal evolution from the HadCRUT5 mask. c, Time-averaged pixel-wise reconstruction error (reconstruction - ground truth) in unobserved regions for a single member of DM-Fid (N = 1) and the benchmark models. The global metrics denote the latitude-weighted mean absolute error (MAE). d, Bias in the lag-one autocorrelation (AC1), a metric for temporal memory, for DM-Fid (N = 1) and the benchmark models, summarized by the latitude-weighted mean absolute bias. e, f, Monthly and annual residuals for the reconstructed Global Mean Temperature (GMT), defined as reconstructed GMT minus ground-truth GMT. The DM-Ens reconstruction represents the average of the GMTs calculated from each of the 50 ensemble members. The "masked ground truth" represents the GMT calculated directly from the held-out CMIP6 member with the sparse observational masks, illustrating the observational coverage bias.
Figure 4: Reconstruction of key global temperature datasets.a, Spatial comparison of a representative month (August 1874) for HadCRUT5 and Berkeley Earth. Columns display the sparse input observations (the HadCRUT5 median and the combination of HadSST 4.2 and Berkeley Earth land stations), the official infilled product, and a single DM-Fid reconstruction (N = 1). Note the recovery of sharp regional anomalies in the DM-Fid reconstruction compared to the smoother official baselines. b, Annual Global Mean Temperature (GMT) anomalies relative to the 1850--1900 pre-industrial baseline. For HadCRUT5, the reconstruction (pink) represents the mean and 95% confidence interval derived from a nested ensemble (50 DM-Ens members generated for each of the 200 observational ensemble members). For Berkeley Earth, the uncertainty is derived from 50 DM-Ens members. c, Evolution of the spatial coverage of temperature extremes. The DM-Fid reconstruction is presented as the mean of 5 ensemble members, with shaded regions indicating the 95% confidence interval. Hot and cold extremes are defined as the fraction of the total area with valid data where anomalies exceed the 95th or fall below the 5th percentile, respectively, of the modern reference period (1991–2020).
Figure 5: Reconstruction of key precipitation datasets.a, Global spatial patterns of precipitation anomalies for a representative historical month (December 1918) based on CRU TS 4.09 and GPCC v2022. Columns display the sparse inputs aggregated from raw land stations, the official infilled products, and single realizations (N = 1) from our high-fidelity reconstruction (DM-Fid). Note that DM-Fid infers coherent oceanic precipitation structures strictly from land-based conditioning. b, Regional comparison over Africa highlighting the textural distinction: official products exhibit characteristic geometric artifacts (smooth radial clusters), whereas DM-Fid generates complex, physically plausible spatial structures. c, Annual Global Land Mean Precipitation (LMP) anomalies relative to the 1961–1990 baseline. The DM-Ens reconstruction (pink, N = 50) is shown as the ensemble mean with a 95% confidence interval. d, e, Evolution of the spatial coverage of wet extremes (fraction of land area exceeding the 95th percentile of the reference period) for CRU TS 4.09 (d) and GPCC v2022 (e). The DM-Fid reconstruction is presented as the mean of 5 ensemble members, with shaded regions indicating the 95% confidence interval. The divergence in the early 20th century is driven by station-free regions, where CRU TS 4.09 defaults to climatology (zero coverage), and GPCC v2022 suppresses variance, whereas DM-Fid maintains realistic variability consistent with ERA5 (blue line, post-1940).
...and 25 more figures

Generative deep learning improves reconstruction of global historical climate records

TL;DR

Abstract

Generative deep learning improves reconstruction of global historical climate records

Authors

TL;DR

Abstract

Table of Contents

Figures (30)