Revealing the Power of Masked Autoencoders in Traffic Forecasting
Jiarui Sun, Yujie Fan, Chin-Chia Michael Yeh, Wei Zhang, Girish Chowdhary
TL;DR
STMAE tackles data scarcity and instability in traffic forecasting by introducing a generative self-supervised framework that pretrains existing spatial-temporal encoders with dual masking: spatial masking via biased random walks on the graph and temporal masking via patch-based masking on input sequences. The pretraining uses two lightweight decoders to reconstruct the masked adjacency and data, guided by ${\mathcal{L}}_{\mathbf{A}}$ and ${\mathcal{L}}_{\mathcal{X}}$, and a balancing parameter ${\lambda}$, after which the encoder is fine-tuned with the backbone predictor. Across four PEMS datasets and three backbones (DCRNN, AGCRN, MTGNN), STMAE consistently improves forecasting accuracy over strong baselines and a contrastive SSL method (STGCL), while maintaining stable training and reducing performance degradation at longer horizons. The work demonstrates that dual masking is an effective, plug-and-play strategy to enhance spatial-temporal models for traffic forecasting with practical implications for urban planning and traffic management.
Abstract
Traffic forecasting, crucial for urban planning, requires accurate predictions of spatial-temporal traffic patterns across urban areas. Existing research mainly focuses on designing complex models that capture spatial-temporal dependencies among variables explicitly. However, this field faces challenges related to data scarcity and model stability, which results in limited performance improvement. To address these issues, we propose Spatial-Temporal Masked AutoEncoders (STMAE), a plug-and-play framework designed to enhance existing spatial-temporal models on traffic prediction. STMAE consists of two learning stages. In the pretraining stage, an encoder processes partially visible traffic data produced by a dual-masking strategy, including biased random walk-based spatial masking and patch-based temporal masking. Subsequently, two decoders aim to reconstruct the masked counterparts from both spatial and temporal perspectives. The fine-tuning stage retains the pretrained encoder and integrates it with decoders from existing backbones to improve forecasting accuracy. Our results on traffic benchmarks show that STMAE can largely enhance the forecasting capabilities of various spatial-temporal models.
