Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

Tung Nguyen; Rohan Shah; Hritik Bansal; Troy Arcomano; Romit Maulik; Veerabhadra Kotamarthi; Ian Foster; Sandeep Madireddy; Aditya Grover

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Sandeep Madireddy, Aditya Grover

TL;DR

Stormer demonstrates that a straightforward transformer, when equipped with weather-specific embedding, randomized dynamics forecasting across multiple time intervals, and a pressure-weighted loss, can achieve state-of-the-art long-range weather forecasts while using substantially less data and compute. The model supports test-time aggregation of multiple interval-based forecasts to improve accuracy, particularly for lead times beyond a week. On WeatherBench 2 with ERA5 data at 1.40625°, Stormer delivers competitive short-range results and clear gains at longer horizons, outperforming strong baselines with far lower resource requirements. The work also provides thorough ablations and scaling analyses, highlighting the contributions of each component and the potential for future development of scalable weather/climate foundation models.

Abstract

Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints are available at https://github.com/tung-nd/stormer.

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 19 figures)

This paper contains 36 sections, 4 equations, 19 figures.

Introduction
Background and Preliminaries
Methodology
Training
Pressure-weighted loss
Multi-step finetuning
Inference
Model architecture
Weather-specific embedding
Stormer Transformer block
Experiments
Comparison with State-of-the-art models
Ablation studies
Scaling analysis
Related Work
...and 21 more sections

Figures (19)

Figure 1: Illustration of an example $5$-day forecast of near-surface wind speed (color-fill) and mean sea level pressure (contours). On December 31, 2020, an extratropical cyclone impacted Alaska setting a new North Pacific low-pressure record. Here, we evaluate the ability of Stormer to predict this record-breaking event 5 days in advance. Using initial conditions from 0000 UTC, 26 December 2011, Stormer was able to successfully forecast both the location and strength of this extreme event with great accuracy.
Figure 2: Different approaches to weather forecasting. Direct and continuous methods output forecasts directly, but continuous forecasting is adaptable to various lead times by conditioning on $T$. Iterative forecasting generates forecasts at small intervals $\delta t$, which are rolled out for the final forecast. Our proposed randomized iterative forecasting combines continuous and iterative methods.
Figure 3: Preliminary results on forecasting surface temperature that led to the design choices of Stormer: (a) Different intervals are better at different lead times, (b) Weather-specific embedding is superior to standard ViT embedding, and (c) Adaptive layer norm outperforms additive embedding.
Figure 4: Global forecast results of Stormer and the baselines. We show the latitude-weighted RMSE for select variables. Stormer is on par or outperforms the baselines for the shown variables. During the later portion of the forecasts, Stormer gains $\sim1$ day of forecast skill with respect to climatology compared to the next best deep learning model. We note that Stormer was trained on much lower resolution data (1.40625$^\circ$) compared to Pangu-Weather (0.25$^\circ$) and GraphCast (0.25$^\circ$).
Figure 5: Ablation studies showing the importance of different components in Stormer: (a) Randomized forecasting, (b) Pressure-weighted loss, and (c) Dynamics forecasting.
...and 14 more figures

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

TL;DR

Abstract

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (19)