W-MAE: Pre-trained weather model with masked autoencoder for multi-variable weather forecasting
Xin Man, Chenghong Zhang, Jin Feng, Changyu Li, Jie Shao
TL;DR
W-MAE introduces a Weather model with Masked AutoEncoder pre-training built on a Vision Transformer to capture spatial correlations across multiple meteorological variables. The approach uses self-supervised MAE pre-training on ERA5 data, then fine-tunes for multi-variable forecasting and precipitation tasks, yielding robust short-to-medium horizon performance and superior precipitation accuracy relative to FourCastNet. The method enables easy transfer to other task-specific models and demonstrates significant gains in accuracy and efficiency for ensemble forecasts, with practical training-time advantages. This work highlights the value of task-agnostic pre-training for weather and climate forecasting, suggesting potential extensions to longer-term forecasting.
Abstract
Weather forecasting is a long-standing computational challenge with direct societal and economic impacts. This task involves a large amount of continuous data collection and exhibits rich spatiotemporal dependencies over long periods, making it highly suitable for deep learning models. In this paper, we apply pre-training techniques to weather forecasting and propose W-MAE, a Weather model with Masked AutoEncoder pre-training for weather forecasting. W-MAE is pre-trained in a self-supervised manner to reconstruct spatial correlations within meteorological variables. On the temporal scale, we fine-tune the pre-trained W-MAE to predict the future states of meteorological variables, thereby modeling the temporal dependencies present in weather data. We conduct our experiments using the fifth-generation ECMWF Reanalysis (ERA5) data, with samples selected every six hours. Experimental results show that our W-MAE framework offers three key benefits: 1) when predicting the future state of meteorological variables, the utilization of our pre-trained W-MAE can effectively alleviate the problem of cumulative errors in prediction, maintaining stable performance in the short-to-medium term; 2) when predicting diagnostic variables (e.g., total precipitation), our model exhibits significant performance advantages over FourCastNet; 3) Our task-agnostic pre-training schema can be easily integrated with various task-specific models. When our pre-training framework is applied to FourCastNet, it yields an average 20% performance improvement in Anomaly Correlation Coefficient (ACC).
