Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics

Matthias Karlbauer; Danielle C. Maddix; Abdul Fatir Ansari; Boran Han; Gaurav Gupta; Yuyang Wang; Andrew Stuart; Michael W. Mahoney

Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics

Matthias Karlbauer, Danielle C. Maddix, Abdul Fatir Ansari, Boran Han, Gaurav Gupta, Yuyang Wang, Andrew Stuart, Michael W. Mahoney

TL;DR

This work addresses the question of which DLWP backbone best suits weather forecasting across horizons by establishing a controlled benchmark using synthetic Navier–Stokes dynamics and WeatherBench data. It systematically compares GNN, Transformer, U‑Net, and FNO backbones across parameter budgets, training protocols, and data representations (LatLon vs HEALPix), evaluating with RMSE, ACC, and long-range diagnostics. Key findings show TFNO excels on synthetic NS dynamics, ConvLSTM and SwinTransformer perform well for short-to-mid WeatherBench forecasts, and spherical designs like SFNO, FourCastNet, Pangu-Weather, and GraphCast offer stability and physical plausibility for climate-scale rollouts. The results underscore the importance of inductive biases and spherical representations for long-range forecasting and provide a rigorous framework to guide backbone choice and future DLWP development.

Abstract

A large number of Deep Learning Weather Prediction (DLWP) architectures -- based on various backbones, including U-Net, Transformer, Graph Neural Network, and Fourier Neural Operator (FNO) -- have demonstrated their potential at forecasting atmospheric states. However, due to differences in training protocols, forecast horizons, and data choices, it remains unclear which (if any) of these methods and architectures are most suitable for weather forecasting and for future model development. Here, we step back and provide a detailed empirical analysis, under controlled conditions, comparing and contrasting the most prominent DLWP models, along with their backbones. We accomplish this by predicting synthetic two-dimensional incompressible Navier-Stokes and real-world global weather dynamics. On synthetic data, we observe favorable performance of FNO, while on the real-world WeatherBench dataset, our results demonstrate the suitability of ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged weather rollouts of up to 50 years, we observe superior stability and physical soundness in architectures that formulate a spherical data representation, i.e., GraphCast and Spherical FNO. The code is available at https://github.com/amazon-science/dlwp-benchmark.

Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics

TL;DR

Abstract

Paper Structure (52 sections, 31 figures, 4 tables)

This paper contains 52 sections, 31 figures, 4 tables.

Introduction
Our Approach, Related Work, and Methods
Experiments and Results
Synthetic Navier-Stokes Simulation
Real-World Weather Data
Data Selection
Model Setup
Optimization
Evaluation
Short- to Mid-Ranged Forecasts
Long-Range Rollouts
Physical Soundness
Discussion
Navier-Stokes Experiments
Model, Data, and Training Specifications
...and 37 more sections

Figures (31)

Figure 1: Forecast errors on WeatherBench at five days lead-time (left; shaded areas show error spread over three different seeds), GPU memory requirements (center), and runtime (right) of prominent deep learning weather prediction models and their backbones over different parameter counts.
Figure 2: RMSE scores on $\Phi_{500}$ (geopotential at a height of 500hPa atmospheric pressure) at three different lead times (3 days left, 5 days center, 7 days right) vs. the number of parameters for DLWP models and backbones trained on a subset of variables from the WeatherBench dataset. Shaded areas depict error spread over three different model seeds.
Figure 3: RMSE on $\Phi_{500}$ for different models trained on the LatLon (solid lines) or on the HEALPix (HPX, dashed lines) mesh. When operating on the distortion-reducing HEALPix mesh, all three benchmarked methods improve their forecast performance at longer lead times.
Figure 4: Zonally averaged $Z_{500}$ (geopotential height at an atmospheric pressure of 500hPa) forecasts of selected models initialized on Jan. 01, 2017, and run forward for 365 days. The verification panel (left) illustrates the seasonal cycle, where lower air pressures are observed on the northern hemisphere in Jan., Feb., Nov., Dec., and higher pressures in Jul., Aug., Sep. (and vice versa on the southern hemisphere). The black line indicates the 540dem (in decameters) progress and is added to each panel to showcase how each model's forecast captures the seasonal trend.
Figure 5: Zonally averaged $U_{10}$ winds over 365 days lead time displayed for verification (first row), ConvLSTM with 16M parameters (second row), and SFNO with 128M parameters (third row). Left and center showcase single rollouts initialized in January and June, respectively, while the right-most panel provides an average computed over all 104 forecasts, initialized from January through December 2017. While SFNO (third row) neatly reproduces the annual distribution of winds, showing the importance of spherical representation, ConvLSTM (second row) fails at capturing these dynamics on long forecast ranges.
...and 26 more figures

Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics

TL;DR

Abstract

Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (31)