Table of Contents
Fetching ...

On Neural Scaling Laws for Weather Emulation through Continual Training

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov, Amir Gholami, Dmitriy Morozov, Michael W. Mahoney

Abstract

Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strategy. We show that models trained in this minimalist way follow predictable scaling trends and even outperform standard cosine learning rate schedules. Cooldown phases can be re-purposed to improve downstream performance, e.g., enabling accurate multi-step rollouts over longer forecast horizons as well as sharper predictions through spectral loss adjustments. We also systematically explore a wide range of model and dataset sizes under various compute budgets to construct IsoFLOP curves, and we identify compute-optimal training regimes. Extrapolating these trends to larger scales highlights potential performance limits, demonstrating that neural scaling can serve as an important diagnostic for efficient resource allocation. We open-source our code for reproducibility.

On Neural Scaling Laws for Weather Emulation through Continual Training

Abstract

Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strategy. We show that models trained in this minimalist way follow predictable scaling trends and even outperform standard cosine learning rate schedules. Cooldown phases can be re-purposed to improve downstream performance, e.g., enabling accurate multi-step rollouts over longer forecast horizons as well as sharper predictions through spectral loss adjustments. We also systematically explore a wide range of model and dataset sizes under various compute budgets to construct IsoFLOP curves, and we identify compute-optimal training regimes. Extrapolating these trends to larger scales highlights potential performance limits, demonstrating that neural scaling can serve as an important diagnostic for efficient resource allocation. We open-source our code for reproducibility.

Paper Structure

This paper contains 13 sections, 8 equations, 13 figures, 1 table, 1 algorithm.

Figures (13)

  • Figure 1: Neural scaling for weather emulation. We pre-train several models using continual training (constant learning rates with periodic cooldowns; see §\ref{['sec:methods']}), and we identify compute-optimal regimes to train the neural emulator so that neither data nor model size saturate at different compute budgets. At each FLOP budget, several model sizes (up to 400M) are trained to different dataset sizes to form IsoFLOPs that demonstrate the tradeoff between model and data size. Unlike NLP models, these systems are trained for multiple epochs (indicated by vertical dotted lines), causing samples to be revisited after the first epoch and effectively be treated as pseudo-samples. We fit parabolas to each IsoFLOP, and we track the compute-optimal model at each budget.
  • Figure 2: Loss behavior for cosine vs constant LR with cooldown. (left) LR schedules: The cosine schedule follows a half-cosine decay after a fixed warmup, while the constant$+$cooldown has a constant LR after the same fixed warmup, but then cools down rapidly to 0 at the end. The cooldowns happen at the last 5% of iterations. (right) Loss vs iterations for different Swin models: The validation loss of the model continuously trained with a constant LR and cooled down to different iteration counts (here 5% of the total iteration is used as a cooldown period) shows better losses compared to the Swin trained from scratch with different cosine schedulers that match the total iteration count.
  • Figure 3: Loss as a function of total iterations used for cooldown. We show the MSE over 36 hours (6 autoregressive steps) of prediction averaged over the validation data (2017). The MSE loss decreases predictably with longer cooldown durations and this is true over multistep predictions. At around 5%, the gains start to diminish. The loss behavior also holds when the cooldown is repurposed with 4-step AR loss that allows for lower errors across the longer horizon.
  • Figure 4: Cooldowns can be used for alignment. When evaluated on testing year 2020, the Swin cooled down at 24000 iterations is able to surpass the NWP (HRES) and is comparable to the state-of-the-art deterministic deep learning benchmark, Graphcast. When AR is used in cooldown ( Swin- AR4), the RMSE drops further, consistent with the use of this loss. When AMSE is used ( Swin- AMSE), the PSD retains high wavenumbers. This is easily seen in $q700$ where the AMSE spectra matches ERA5 perfectly, but other models blur significantly. AR contributes to more blurring in favor of reduced RMSE (visible in dissipation of power in high wavenumbers). We note that HRES models weather at a $0.1^\circ$ resolution and hence shows higher resolution.
  • Figure 5: Optimal model sizes as a function of compute. (left) Similar to Fig. \ref{['fig:scaling-data']}, we show the validation loss as a function of model size for different compute budgets. For each budget, the different model sizes are trained to a different number of iterations to create an IsoFLOP. We track the minima for each IsoFLOP (through a fitted parabola). (middle) We fit an empirical scaling law to find the optimal model size for any FLOP budget and project to 2.25E+21 FLOPs to find the optimal model. (right) We also project the loss to the final FLOP value---the measured loss at this FLOP value is saturated at 0.005.
  • ...and 8 more figures