Table of Contents
Fetching ...

Inductive biases in deep learning models for weather prediction

Jannik Thuemmel, Matthias Karlbauer, Sebastian Otte, Christiane Zarfl, Georg Martius, Nicole Ludwig, Thomas Scholten, Ulrich Friedrich, Volker Wulfmeyer, Bedartha Goswami, Martin V. Butz

TL;DR

The paper analyzes why state-of-the-art deep learning weather prediction (DLWP) systems succeed by decomposing inductive biases into data selection, learning objectives, loss functions, architectures, and training. Using four representative models (CubedSphereNet, FourCastNet, GraphCast, PanGu-Weather) as focal points, it shows how data availability and compute are dominant, while inductive biases in design choices shape efficiency and generalisation. It highlights deterministic and probabilistic losses, encoder–processor–decoder architectures, and curriculum-based training as key design levers, and discusses promising directions in probabilistic modeling and physics-informed learning. The work provides a practical framework for improving DLWP performance, reliability, and scalability, with implications for near-term deployment and future research in weather and climate forecasting.

Abstract

Deep learning has gained immense popularity in the Earth sciences as it enables us to formulate purely data-driven models of complex Earth system processes. Deep learning-based weather prediction (DLWP) models have made significant progress in the last few years, achieving forecast skills comparable to established numerical weather prediction models with comparatively lesser computational costs. In order to train accurate, reliable, and tractable DLWP models with several millions of parameters, the model design needs to incorporate suitable inductive biases that encode structural assumptions about the data and the modelled processes. When chosen appropriately, these biases enable faster learning and better generalisation to unseen data. Although inductive biases play a crucial role in successful DLWP models, they are often not stated explicitly and their contribution to model performance remains unclear. Here, we review and analyse the inductive biases of state-of-the-art DLWP models with respect to five key design elements: data selection, learning objective, loss function, architecture, and optimisation method. We identify the most important inductive biases and highlight potential avenues towards more efficient and probabilistic DLWP models.

Inductive biases in deep learning models for weather prediction

TL;DR

The paper analyzes why state-of-the-art deep learning weather prediction (DLWP) systems succeed by decomposing inductive biases into data selection, learning objectives, loss functions, architectures, and training. Using four representative models (CubedSphereNet, FourCastNet, GraphCast, PanGu-Weather) as focal points, it shows how data availability and compute are dominant, while inductive biases in design choices shape efficiency and generalisation. It highlights deterministic and probabilistic losses, encoder–processor–decoder architectures, and curriculum-based training as key design levers, and discusses promising directions in probabilistic modeling and physics-informed learning. The work provides a practical framework for improving DLWP performance, reliability, and scalability, with implications for near-term deployment and future research in weather and climate forecasting.

Abstract

Deep learning has gained immense popularity in the Earth sciences as it enables us to formulate purely data-driven models of complex Earth system processes. Deep learning-based weather prediction (DLWP) models have made significant progress in the last few years, achieving forecast skills comparable to established numerical weather prediction models with comparatively lesser computational costs. In order to train accurate, reliable, and tractable DLWP models with several millions of parameters, the model design needs to incorporate suitable inductive biases that encode structural assumptions about the data and the modelled processes. When chosen appropriately, these biases enable faster learning and better generalisation to unseen data. Although inductive biases play a crucial role in successful DLWP models, they are often not stated explicitly and their contribution to model performance remains unclear. Here, we review and analyse the inductive biases of state-of-the-art DLWP models with respect to five key design elements: data selection, learning objective, loss function, architecture, and optimisation method. We identify the most important inductive biases and highlight potential avenues towards more efficient and probabilistic DLWP models.
Paper Structure (22 sections, 4 figures, 1 table)

This paper contains 22 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Objective definitions in the four reviewed models. All models shown here support autoregressive roll-outs of, in principle, arbitrary length. GraphCast learns a map from two observed states, 6 hours apart, to a subsequent state 6 hours in the future. PanGu-Weather maps a single observed state to a desired lead-time state with intermediate step-sizes being combined from 1, 3, 6 or 24 hours according to a greedy algorithm that minimises the number of autoregressive steps. FourCastNet maps a single state 6 hours into the future. CubedSphereNet maps two observed states onto the two subsequent states, with each state being 6 hours apart.
  • Figure 2: Loss functions used in the four models under consideration. All models utilise an L-p norm between the observed and predicted atmospheric state, evaluated at each spatio-temporal location individually and then averaged across all locations. GraphCast additionally weighs the loss with pre-defined weights per variable and per latitude. PanGu-Weather utilises per-variable weights derived from the performance of an earlier training run. FourCastNet weighs the loss per latitude and CubedSphereNet does not report any further weighting of the loss.
  • Figure 3: Architecture overview of the four reviewed models. We emphasise encode-process-decode structures as well as the use of (multiple) latent spatial scales and/or representation formats. Computational block designs are illustrated in the lower part of the figure, although we note that we omitted the ubiquitous use of LayerNorm in all of these blocks. For detailed explanations and illustrations of the architectures we refer to the respective figures in their original papers.
  • Figure 4: Training schemes of the four considered models. GraphCast utilises a learning rate schedule consisting of a short linear warm-up, followed by a long cosine annealing phase and another short fine-tuning phase in the end. The number of autoregressive steps is initially kept at one but increased up to twelve in the final stage of training. PanGu-Weather is trained according only on one-step ahead predictions with a long cosine annealing schedule, we assume that the number of training iterations is different for each sub-model but note that this information has not been provided by the authors. FourCastNet uses a two-stage training procedure, first on one-step and then on two-step forecasts, with a cosine annealing schedule in each stage. CubedSphereNet does not use a pre-defined schedule, instead the learning rate is decreased by a factor of five once the validation criterion has not decreased for a given number of steps, we illustrate a possible trajectory here.