CaFA: Global Weather Forecasting with Factorized Attention on Sphere

Zijie Li; Anthony Zhou; Saurabh Patil; Amir Barati Farimani

CaFA: Global Weather Forecasting with Factorized Attention on Sphere

Zijie Li, Anthony Zhou, Saurabh Patil, Amir Barati Farimani

TL;DR

CaFA tackles the computational bottleneck of global Transformer-based weather forecasting by introducing factorized axial attention on the sphere, preserving spherical geometry while reducing cost. The method decomposes the multi-dimensional attention kernel into axis-specific operators and augments it with distance-encoded attention and spherical harmonic positional encoding. It uses an Encoder-Processor-Decoder architecture with height compression/recovery and patch-based spatial downsampling to operate on a latent grid, achieving competitive deterministic forecasts at $1.5^ ext{°}$ resolution for lead times of $0$–$7$ days with roughly $2\times 10^8$ parameters. Results show CaFA outperforms IFS HRES on several variables within $5$ days and offers favorable accuracy-efficiency trade-offs versus leading MLWP models, while highlighting avenues for higher-resolution scaling and probabilistic forecasting.

Abstract

Accurate weather forecasting is crucial in various sectors, impacting decision-making processes and societal events. Data-driven approaches based on machine learning models have recently emerged as a promising alternative to numerical weather prediction models given their potential to capture physics of different scales from historical data and the significantly lower computational cost during the prediction stage. Renowned for its state-of-the-art performance across diverse domains, the Transformer model has also gained popularity in machine learning weather prediction. Yet applying Transformer architectures to weather forecasting, particularly on a global scale is computationally challenging due to the quadratic complexity of attention and the quadratic increase in spatial points as resolution increases. In this work, we propose a factorized-attention-based model tailored for spherical geometries to mitigate this issue. More specifically, it utilizes multi-dimensional factorized kernels that convolve over different axes where the computational complexity of the kernel is only quadratic to the axial resolution instead of overall resolution. The deterministic forecasting accuracy of the proposed model on $1.5^\circ$ and 0-7 days' lead time is on par with state-of-the-art purely data-driven machine learning weather prediction models. We also showcase the proposed model holds great potential to push forward the Pareto front of accuracy-efficiency for Transformer weather models, where it can achieve better accuracy with less computational cost compared to Transformer based models with standard attention.

CaFA: Global Weather Forecasting with Factorized Attention on Sphere

TL;DR

resolution for lead times of

–

days with roughly

parameters. Results show CaFA outperforms IFS HRES on several variables within

days and offers favorable accuracy-efficiency trade-offs versus leading MLWP models, while highlighting avenues for higher-resolution scaling and probabilistic forecasting.

Abstract

and 0-7 days' lead time is on par with state-of-the-art purely data-driven machine learning weather prediction models. We also showcase the proposed model holds great potential to push forward the Pareto front of accuracy-efficiency for Transformer weather models, where it can achieve better accuracy with less computational cost compared to Transformer based models with standard attention.

Paper Structure (25 sections, 15 equations, 16 figures, 8 tables)

This paper contains 25 sections, 15 equations, 16 figures, 8 tables.

Introduction
Related Works
Methodology
Attention mechanism
Axial factorized attention on sphere
Factorized attention
Distance encoding and positional encoding
Grid projection
Height compression/recovery
Spatial downsample/upsample
Model overview
Architecture
Training
Experiments and Results
Evaluation setting
...and 10 more sections

Figures (16)

Figure 2.1: Main schematic of the proposed Transformer-based weather forecast model - CaFA. The model approximates a Markovian mapping that forwards the last system state to the next system state with a fixed time interval $\Delta t$.
Figure 4.1: Normalized RMSE difference: $(\text{RMSE}_{\text{HRES}} - \text{RMSE}_{\text{CaFA}}) / \text{RMSE}_{\text{HRES}}$. Blue colors indicate IFS HRES has larger RMSE while red colors indicate CaFA has larger RMSE. Darker colors indicate a larger normalized difference. The plotting style follows GraphCast graphcast2023science.
Figure 4.2: Comparison of CaFA and NWP in temperature and pressure's prediction of year 2020.
Figure 4.3: Comparison of CaFA and NWP in specific humidity and wind velocity prediction of year 2020.
Figure 4.4: Example rollout visualizations of the model's prediction versus reference ERA5 reanalysis data at different lead times. The initialization time is 00:00 UTC on August 11, 2020.
...and 11 more figures

CaFA: Global Weather Forecasting with Factorized Attention on Sphere

TL;DR

Abstract

CaFA: Global Weather Forecasting with Factorized Attention on Sphere

Authors

TL;DR

Abstract

Table of Contents

Figures (16)