Comparing and Contrasting DLWP Backbones on Navier-Stokes and Atmospheric Dynamics
Matthias Karlbauer, Danielle C. Maddix, Abdul Fatir Ansari, Boran Han, Gaurav Gupta, Yuyang Wang, Andrew Stuart, Michael W. Mahoney
TL;DR
This work addresses the question of which DLWP backbone best suits weather forecasting across horizons by establishing a controlled benchmark using synthetic Navier–Stokes dynamics and WeatherBench data. It systematically compares GNN, Transformer, U‑Net, and FNO backbones across parameter budgets, training protocols, and data representations (LatLon vs HEALPix), evaluating with RMSE, ACC, and long-range diagnostics. Key findings show TFNO excels on synthetic NS dynamics, ConvLSTM and SwinTransformer perform well for short-to-mid WeatherBench forecasts, and spherical designs like SFNO, FourCastNet, Pangu-Weather, and GraphCast offer stability and physical plausibility for climate-scale rollouts. The results underscore the importance of inductive biases and spherical representations for long-range forecasting and provide a rigorous framework to guide backbone choice and future DLWP development.
Abstract
A large number of Deep Learning Weather Prediction (DLWP) architectures -- based on various backbones, including U-Net, Transformer, Graph Neural Network, and Fourier Neural Operator (FNO) -- have demonstrated their potential at forecasting atmospheric states. However, due to differences in training protocols, forecast horizons, and data choices, it remains unclear which (if any) of these methods and architectures are most suitable for weather forecasting and for future model development. Here, we step back and provide a detailed empirical analysis, under controlled conditions, comparing and contrasting the most prominent DLWP models, along with their backbones. We accomplish this by predicting synthetic two-dimensional incompressible Navier-Stokes and real-world global weather dynamics. On synthetic data, we observe favorable performance of FNO, while on the real-world WeatherBench dataset, our results demonstrate the suitability of ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged weather rollouts of up to 50 years, we observe superior stability and physical soundness in architectures that formulate a spherical data representation, i.e., GraphCast and Spherical FNO. The code is available at https://github.com/amazon-science/dlwp-benchmark.
