Table of Contents
Fetching ...

Scaling Laws of Global Weather Models

Yuejiang Yu, Langwen Huang, Alexandru Calotoiu, Torsten Hoefler

Abstract

Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size ($N$), dataset size ($D$), and compute budget ($C$). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to longer training durations yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.

Scaling Laws of Global Weather Models

Abstract

Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size (), dataset size (), and compute budget (). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to longer training durations yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.
Paper Structure (20 sections, 27 equations, 7 figures, 7 tables)

This paper contains 20 sections, 27 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Scaling behavior of global weather models. We report the validation loss during training to evaluate model performance. We also report the training data required to reach 25 Pflop compute budget.
  • Figure 2: Data-scaling laws across weather forecasting models: $\mathcal{L}(D) = \alpha D^{-\beta}$. Aurora (red) achieves the best $\mathcal{L}$ at $D=100$ TB and also has the best $\beta$ value, representing most efficient scaling with more data.
  • Figure 3: Parameter-scaling laws for different weather prediction models trained on 15.0 TB and 30.0 TB datasets. Each marker corresponds to a model variant, with dashed lines denoting power-law fits of the form $\mathcal{L}(N) = \gamma N^{-\delta}$. Marker transparency indicates dataset size, with darker markers representing larger $D$. At fixed $D$, validation loss improves with larger $N$, and increasing $D$ shifts the scaling curves downward while preserving or increasing the scaling exponent $\delta$ within each model.
  • Figure 4: Wider models perform better. For each model, we use two configurations: one is wider and one is narrower. They have roughly the same $N$ but different shapes. In all models, wider configurations consistently achieve lower validation loss. This implies weather forecasting benefits more from representational capacity (width) than from additional nonlinear transformations (depth).
  • Figure 5: Compute-Optimal Training. The panels illustrate $\mathcal{L}$ as a function of $D$, with each curve representing a fixed $C$. The resulting parabolas identify the compute-optimal frontier—the specific ratio of $N$ and $D$ that minimizes loss for a given $C$. For each model, we have $C \sim ND$, which implies $N \sim C/D$ along each curve. For Pangu, the loss is predominantly determined by $D$. Aurora and SFNO exhibit a clear minimum in the curve, indicating a trade-off between larger $N$ and larger dataset size. GraphCast and AIFS shows mainly the left half of the parabolas, indicating performance is bottlenecked by insufficient $D$.
  • ...and 2 more figures