STC-ViT: Spatio Temporal Continuous Vision Transformer for Medium-range Global Weather Forecasting
Hira Saleem, Flora Salim, Cormac Purcell
TL;DR
STC-ViT introduces a Spatio-Temporal Continuous Vision Transformer that models weather dynamics as a continuous-time process by integrating a Fourier spectral operator with a Neural ODE-based transformer encoder. By fusing patch-level ViT tokens with global spectral context and spherical-harmonic positional encoding, the approach achieves competitive medium-range forecasts with a shallow transformer and very fast inference. The work includes extensive analyses of solvers, depth, and time-grid resolution, and demonstrates strong performance on WeatherBench and WeatherBench2 datasets while reducing data and compute costs. It highlights the potential of continuous-depth, topology-aware architectures for efficient, accurate weather forecasting and points to future directions in probabilistic modeling and multi-dataset generalization.
Abstract
Operational Numerical Weather Prediction (NWP) system relies on computationally expensive physics-based models. Recently, transformer models have shown remarkable potential in weather forecasting achieving state-of-the-art results. However, traditional transformers discretize spatio-temporal dimensions, limiting their ability to model continuous dynamical weather processes. Moreover, their reliance on increased depth to capture complex dependencies results in higher computational cost and parameter redundancy. We address these issues with STC-ViT, a Spatio-Temporal Continuous Vision Transformer for weather forecasting. STC-ViT integrates a Fourier Neural Operator (FNO) for global spatial operators with a transformer parameterised Neural ODE for continuous-time dynamics, yielding a space-time continuous model for weather forecasting. Our proposed method achieves competitive forecasting performance even with a shallow, single-layer transformer encoder mitigating the reliance on deeper networks. STC-ViT generates complete forecast trajectories with an inference speed of only 0.125 seconds and achieves strong medium-range forecasting skill on 1.5-degree WeatherBench 2 as compared to state-of-the-art data-driven and NWP models trained on higher-resolution data, with substantially lower data and compute costs. We also provide detailed empirical analysis on model's performance with respect to denser time grids, higher-accuracy ODE solvers, and deeper transformer stacks.
