Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning
Jiewen Deng, Renhe Jiang, Jiaqi Zhang, Xuan Song
TL;DR
MoSSL addresses the challenge of robust multi-modality spatio-temporal forecasting by explicitly modeling interactions and dynamic heterogeneity across space, time, and modalities. It introduces a Multi-Modality Spatio-Temporal Encoder, modality-aware data augmentation, and two self-supervised learning paradigms—Global (GSSL) and Modality (MSSL)—with a jointly optimized objective that combines MSE for forecasting and SSL losses to capture latent heterogeneity. Empirical results on NYC Traffic Demand and BJ Air Quality datasets show MoSSL achieving state-of-the-art accuracy across horizons and modalities, with interpretable insights into heterogeneous components and node-level cross-modality correlations. The approach offers a practical, scalable framework for MoST forecasting with applications in smart cities and environmental monitoring.
Abstract
Multi-modality spatio-temporal (MoST) data extends spatio-temporal (ST) data by incorporating multiple modalities, which is prevalent in monitoring systems, encompassing diverse traffic demands and air quality assessments. Despite significant strides in ST modeling in recent years, there remains a need to emphasize harnessing the potential of information from different modalities. Robust MoST forecasting is more challenging because it possesses (i) high-dimensional and complex internal structures and (ii) dynamic heterogeneity caused by temporal, spatial, and modality variations. In this study, we propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL, which aims to uncover latent patterns from temporal, spatial, and modality perspectives while quantifying dynamic heterogeneity. Experiment results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines. Model implementation is available at https://github.com/beginner-sketch/MoSSL.
