Table of Contents
Fetching ...

STC-ViT: Spatio Temporal Continuous Vision Transformer for Medium-range Global Weather Forecasting

Hira Saleem, Flora Salim, Cormac Purcell

TL;DR

STC-ViT introduces a Spatio-Temporal Continuous Vision Transformer that models weather dynamics as a continuous-time process by integrating a Fourier spectral operator with a Neural ODE-based transformer encoder. By fusing patch-level ViT tokens with global spectral context and spherical-harmonic positional encoding, the approach achieves competitive medium-range forecasts with a shallow transformer and very fast inference. The work includes extensive analyses of solvers, depth, and time-grid resolution, and demonstrates strong performance on WeatherBench and WeatherBench2 datasets while reducing data and compute costs. It highlights the potential of continuous-depth, topology-aware architectures for efficient, accurate weather forecasting and points to future directions in probabilistic modeling and multi-dataset generalization.

Abstract

Operational Numerical Weather Prediction (NWP) system relies on computationally expensive physics-based models. Recently, transformer models have shown remarkable potential in weather forecasting achieving state-of-the-art results. However, traditional transformers discretize spatio-temporal dimensions, limiting their ability to model continuous dynamical weather processes. Moreover, their reliance on increased depth to capture complex dependencies results in higher computational cost and parameter redundancy. We address these issues with STC-ViT, a Spatio-Temporal Continuous Vision Transformer for weather forecasting. STC-ViT integrates a Fourier Neural Operator (FNO) for global spatial operators with a transformer parameterised Neural ODE for continuous-time dynamics, yielding a space-time continuous model for weather forecasting. Our proposed method achieves competitive forecasting performance even with a shallow, single-layer transformer encoder mitigating the reliance on deeper networks. STC-ViT generates complete forecast trajectories with an inference speed of only 0.125 seconds and achieves strong medium-range forecasting skill on 1.5-degree WeatherBench 2 as compared to state-of-the-art data-driven and NWP models trained on higher-resolution data, with substantially lower data and compute costs. We also provide detailed empirical analysis on model's performance with respect to denser time grids, higher-accuracy ODE solvers, and deeper transformer stacks.

STC-ViT: Spatio Temporal Continuous Vision Transformer for Medium-range Global Weather Forecasting

TL;DR

STC-ViT introduces a Spatio-Temporal Continuous Vision Transformer that models weather dynamics as a continuous-time process by integrating a Fourier spectral operator with a Neural ODE-based transformer encoder. By fusing patch-level ViT tokens with global spectral context and spherical-harmonic positional encoding, the approach achieves competitive medium-range forecasts with a shallow transformer and very fast inference. The work includes extensive analyses of solvers, depth, and time-grid resolution, and demonstrates strong performance on WeatherBench and WeatherBench2 datasets while reducing data and compute costs. It highlights the potential of continuous-depth, topology-aware architectures for efficient, accurate weather forecasting and points to future directions in probabilistic modeling and multi-dataset generalization.

Abstract

Operational Numerical Weather Prediction (NWP) system relies on computationally expensive physics-based models. Recently, transformer models have shown remarkable potential in weather forecasting achieving state-of-the-art results. However, traditional transformers discretize spatio-temporal dimensions, limiting their ability to model continuous dynamical weather processes. Moreover, their reliance on increased depth to capture complex dependencies results in higher computational cost and parameter redundancy. We address these issues with STC-ViT, a Spatio-Temporal Continuous Vision Transformer for weather forecasting. STC-ViT integrates a Fourier Neural Operator (FNO) for global spatial operators with a transformer parameterised Neural ODE for continuous-time dynamics, yielding a space-time continuous model for weather forecasting. Our proposed method achieves competitive forecasting performance even with a shallow, single-layer transformer encoder mitigating the reliance on deeper networks. STC-ViT generates complete forecast trajectories with an inference speed of only 0.125 seconds and achieves strong medium-range forecasting skill on 1.5-degree WeatherBench 2 as compared to state-of-the-art data-driven and NWP models trained on higher-resolution data, with substantially lower data and compute costs. We also provide detailed empirical analysis on model's performance with respect to denser time grids, higher-accuracy ODE solvers, and deeper transformer stacks.
Paper Structure (33 sections, 22 equations, 8 figures, 3 tables)

This paper contains 33 sections, 22 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Complete architectural pipeline of STC-ViT
  • Figure 2: RMSE and ACC comparison of STC-ViT trained at $1.5^\circ$ with GraphCast, Pangu Weather at $0.25^\circ$ and IFS-HRES at $0.1^\circ$ resolution data for lead times ranging from 1 to 10 days
  • Figure 3: Scaling analysis shows STC-ViT consistently improves as we employ denser time-grids, higher accuracy order adaptive solvers and increase transformer depth.
  • Figure 4: Ablation studies showing how adding each component to the network improves the performance of STC-ViT. The vanilla-vit with depth shows the worst performance, highlighting the major drawback of transformer architectures
  • Figure 5: 6hr forecast results of STC-ViT
  • ...and 3 more figures