Challenges of learning multi-scale dynamics with AI weather models: Implications for stability and one solution

Ashesh Chattopadhyay; Y. Qiang Sun; Pedram Hassanzadeh

Challenges of learning multi-scale dynamics with AI weather models: Implications for stability and one solution

Ashesh Chattopadhyay, Y. Qiang Sun, Pedram Hassanzadeh

TL;DR

This paper identifies spectral bias as the universal cause of instability and unphysical drift in AI weather models when time-integrated to long horizons. It introduces FouRKS, an architecture-agnostic framework that combines Fourier-based spectral regularization, a convergent RK4 time integrator, and a self-supervised spectrum correction to enforce physical consistency during autoregressive prediction. The authors demonstrate that FouRKS yields long-term stable, physically accurate climate emulations on a two-layer quasi-geostrophic system for hundreds of thousands of days and on ERA5 data for up to a decade, with correct means and spectral structure. These results suggest a path toward reliable, data-driven climate emulation and improved sub-seasonal-to-seasonal forecasting, while acknowledging limitations and avenues for further theory and generalization to radiative forcing and full climate models.

Abstract

Long-term stability and physical consistency are critical properties for AI-based weather models if they are going to be used for subseasonal-to-seasonal forecasts or beyond, e.g., climate change projection. However, current AI-based weather models can only provide short-term forecasts accurately since they become unstable or physically inconsistent when time-integrated beyond a few weeks or a few months. Either they exhibit numerical blow-up or hallucinate unrealistic dynamics of the atmospheric variables, akin to the current class of autoregressive large language models. The cause of the instabilities is unknown, and the methods that are used to improve their stability horizons are ad-hoc and lack rigorous theory. In this paper, we reveal that the universal causal mechanism for these instabilities in any turbulent flow is due to \textit{spectral bias} wherein, \textit{any} deep learning architecture is biased to learn only the large-scale dynamics and ignores the small scales completely. We further elucidate how turbulence physics and the absence of convergence in deep learning-based time-integrators amplify this bias, leading to unstable error propagation. Finally, using the quasi-geostrophic flow and European Center for Medium-Range Weather Forecasting (ECMWF) Reanalysis data as test cases, we bridge the gap between deep learning theory and numerical analysis to propose one mitigative solution to such unphysical behavior. We develop long-term physically-consistent data-driven models for the climate system and demonstrate accurate short-term forecasts, and hundreds of years of time-integration with accurate mean and variability.

Challenges of learning multi-scale dynamics with AI weather models: Implications for stability and one solution

TL;DR

Abstract

Paper Structure (18 sections, 21 equations, 7 figures)

This paper contains 18 sections, 21 equations, 7 figures.

Abstract
Introduction
Results
A universal cause: Spectral bias
Mathematical demonstration of spectral bias in a single layered neural network
A solution: FouRKS (FOUrier-Runge-Kutta-with-Self-supervision)
Performance on QG
Performance analysis of each component of FouRKS
Performance on ERA5
Discussion
Data and Methods
Reanalyis data and state-of-the-art AI-based weather models
The two-layer quasi-geostrophic (QG) system
Baseline U-NET
FouRKS: FOUrier-Runge-Kutta-with-Self-supervision
...and 3 more sections

Figures (7)

Figure 1: Hallucinations in Pangu3D bi2022pangu, GraphCast lam2022graphcast, FourCastNet pathak2022fourcastnet, and FourCastNetv2 bonev2023spherical trained on $0.25^{\circ}$ ERA5 data shown in the Z500 and U250 field. (a) Snapshots of FourCastNet, Pangu3D with 1h time step, and autoregressive GraphCast shows unstable blow-up after a few days. (b) FourCastNetv2, Pangu3D with $24$h time step, and Pangu3D with $6$h time step remains stable but shows unphysical characteristics better shown by (c). (c) To understand the unphysical characteristics in (b), we show long-term mean U250 of all the models that do not match the true U250 mean from ERA5 data. (d) Spectral bias showing that the spherical harmonic-based Fourier spectrum of predicted U250 even for the first time step of prediction does not match the true spectrum of ERA5 at the first time step of prediction although ACC shown in (c) is $\approx 1$. (e) Spectral bias growing by $10$ days of prediction.
Figure 2: Long-term instabilities in a simple U-NET-based digital twin (section \ref{['sec:Unet']}) trained on $2^{\circ}$ ERA5 data (section \ref{['sec:ERA5']}) and QG simulations (section \ref{['sec:QG']}). (a) Latitude-averaged instantaneous Fourier spectrum of predicted Z500 fails to capture the small-scale part of the true spectrum beyond $k_x \geq 25$. (b) Latitude-averaged instantaneous Fourier spectrum of predicted $\psi_1$ with U-NET shows that even for QG simulations, the small-scale part of the true spectrum cannot be captured right from the first time step of prediction ($4.8$ hrs).
Figure 3: Toy 1D example of a single layered neural network fitting a scalar-valued function $f(u)$ as a function of scalar, $u$. As learning progresses, the total gradient of $\hat{L}(k)$ as a function of parameters, $\theta_j$ becomes smaller since $A(k)$ becomes smaller at small wavenumbers, $k$. However, for large wavenumbers, $k$, where $A(k)$ is nonzero, the decaying spectrum of turbulence, ensures that the values is small; moreover the exponential multiplicand, $exp(-|\pi k/2w_j|)$ ensures that the total gradient of $\hat{L}(k)$ remains small and the parameters, $\theta_j$ are not updated, leading to a bias in the small scales, i.e, large values of $k$.
Figure 4: Schematics for each component of the FouRKS framework and the baseline U-NET. More details about each of the components can be found in section \ref{['sec:Unet']} and section \ref{['sec:fourks_method']}.
Figure 5: Long-term statistics showing the mean, PDF, and variability of predicted dynamics of QG using FouRKS (section \ref{['sec:fourks_method']}) and baseline U-NET (section \ref{['sec:Unet']}). The mean, PDF, and EOFs have been computed over $300000$ days of prediction. (a) Zonal- and time-mean of upper-level velocity, $\left<\overline{u}_1\right>$, predicted by FouRKS and true $\left<\overline{u}_1\right>$ shows excellent agreement while U-NET's predicted $\left<\overline{u}_1\right>$ is unphysical. (b) PDF computed with predicted $\psi_1$ with FouRKS shows better agreement with the true PDF as compared to the PDF obtained from the training data. PDF obtained from baseline U-NET could not be plotted with the same axis ranges owing to unphysically large values of the predicted fields. (c) EOF1 from FouRKS shows agreement with the true EOF1. EOF1 from baseline U-NET could not be computed since the predictions from U-NET are unphysically large after $300000$ days, making the numerical computation of EOFs infeasible. (d) Similar to (c) but for EOF2.
...and 2 more figures

Challenges of learning multi-scale dynamics with AI weather models: Implications for stability and one solution

TL;DR

Abstract

Challenges of learning multi-scale dynamics with AI weather models: Implications for stability and one solution

Authors

TL;DR

Abstract

Table of Contents

Figures (7)