Analyzing and Exploring Training Recipes for Large-Scale Transformer-Based Weather Prediction

Jared D. Willard; Peter Harrington; Shashank Subramanian; Ankur Mahesh; Travis A. O'Brien; William D. Collins

Analyzing and Exploring Training Recipes for Large-Scale Transformer-Based Weather Prediction

Jared D. Willard, Peter Harrington, Shashank Subramanian, Ankur Mahesh, Travis A. O'Brien, William D. Collins

TL;DR

This study analyzes training recipes for large-scale transformer-based weather prediction by applying a minimally modified SwinV2 transformer to ERA5 data at full resolution. Through targeted ablations on model size, channel-weighted and latitude-weighted losses, and multi-step fine-tuning, it demonstrates that high deterministic skill can be achieved with off-the-shelf architectures and moderate compute budgets, outperforming IFS in several lead times. However, improvements in RMSE via multi-step fine-tuning can degrade sharpness and ensemble spread, highlighting trade-offs between accuracy and probabilistic fidelity. The results, including WeatherBench 2 evaluations, indicate that carefully tuned training strategies can yield competitive or superior performance relative to other data-driven models, offering practical guidance for researchers and practitioners.

Abstract

The rapid rise of deep learning (DL) in numerical weather prediction (NWP) has led to a proliferation of models which forecast atmospheric variables with comparable or superior skill than traditional physics-based NWP. However, among these leading DL models, there is a wide variance in both the training settings and architecture used. Further, the lack of thorough ablation studies makes it hard to discern which components are most critical to success. In this work, we show that it is possible to attain high forecast skill even with relatively off-the-shelf architectures, simple training procedures, and moderate compute budgets. Specifically, we train a minimally modified SwinV2 transformer on ERA5 data, and find that it attains superior forecast skill when compared against IFS. We present some ablations on key aspects of the training pipeline, exploring different loss functions, model sizes and depths, and multi-step fine-tuning to investigate their effect. We also examine the model performance with metrics beyond the typical ACC and RMSE, and investigate how the performance scales with model size.

Analyzing and Exploring Training Recipes for Large-Scale Transformer-Based Weather Prediction

TL;DR

Abstract

Paper Structure (13 sections, 7 figures)

This paper contains 13 sections, 7 figures.

INTRODUCTION
DATASET & MODEL DETAILS
Data
Model Architecture
Ablations & Experiments
Evaluation
RESULTS
Model size, channel weighting, & multi-step fine-tuning
Downstream effects of multi-step fine-tuning
Effects of latitude-weighted loss
Additional experiments
Weatherbench 2 Evaluation
Conclusions

Figures (7)

Figure 1: Weatherbench 2 deterministic RMSE comparison of forecasts of z500, t2m, and u10m at lead times up to 10 days for the swin model using channel-weighting, 8-step fine-tuning, and latitude-weighted loss, IFS_HRES, Pangu-Weather, and Graphcast compared to climatology.
Figure 2: RMSE comparison of forecasts at lead times up to 7 days for the different model depth and embedding dimension for the SwinV2 model
Figure 3: RMSE comparison of forecasts at lead times up to 7 days for the baseline depth 12 and embedding dimension 768 SwinV2 model with and without custom channel-weighting
Figure 4: RMSE comparison of forecasts at lead times up to 7 days for the model trained with custom channel-weighting, depth 12, and embedding dimension 768 alongside its variants with 4 and 8-step fine tuning
Figure 5: Spatial frequency representation for the z500, t2m, and u10m variables across three model configurations: the baseline model using channel-weighting, and two fine-tuned versions of that model using 4-step and 8-step fine-tuning respectively. For each variable and model, the upper plot shows the power spectral density (PS1D) of both the target ERA5 (black line) and prediction (red line) on a logarithmic scale. The lower plot displays the ratio of predicted to actual PS1D values, with a ratio of 1 (dashed line) indicating perfect alignment. Ratios above or below 1 indicate overestimations or underestimations of power at specific spatial frequencies, respectively.
...and 2 more figures

Analyzing and Exploring Training Recipes for Large-Scale Transformer-Based Weather Prediction

TL;DR

Abstract

Analyzing and Exploring Training Recipes for Large-Scale Transformer-Based Weather Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)