Table of Contents
Fetching ...

Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning

Jiewen Deng, Renhe Jiang, Jiaqi Zhang, Xuan Song

TL;DR

MoSSL addresses the challenge of robust multi-modality spatio-temporal forecasting by explicitly modeling interactions and dynamic heterogeneity across space, time, and modalities. It introduces a Multi-Modality Spatio-Temporal Encoder, modality-aware data augmentation, and two self-supervised learning paradigms—Global (GSSL) and Modality (MSSL)—with a jointly optimized objective that combines MSE for forecasting and SSL losses to capture latent heterogeneity. Empirical results on NYC Traffic Demand and BJ Air Quality datasets show MoSSL achieving state-of-the-art accuracy across horizons and modalities, with interpretable insights into heterogeneous components and node-level cross-modality correlations. The approach offers a practical, scalable framework for MoST forecasting with applications in smart cities and environmental monitoring.

Abstract

Multi-modality spatio-temporal (MoST) data extends spatio-temporal (ST) data by incorporating multiple modalities, which is prevalent in monitoring systems, encompassing diverse traffic demands and air quality assessments. Despite significant strides in ST modeling in recent years, there remains a need to emphasize harnessing the potential of information from different modalities. Robust MoST forecasting is more challenging because it possesses (i) high-dimensional and complex internal structures and (ii) dynamic heterogeneity caused by temporal, spatial, and modality variations. In this study, we propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL, which aims to uncover latent patterns from temporal, spatial, and modality perspectives while quantifying dynamic heterogeneity. Experiment results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines. Model implementation is available at https://github.com/beginner-sketch/MoSSL.

Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning

TL;DR

MoSSL addresses the challenge of robust multi-modality spatio-temporal forecasting by explicitly modeling interactions and dynamic heterogeneity across space, time, and modalities. It introduces a Multi-Modality Spatio-Temporal Encoder, modality-aware data augmentation, and two self-supervised learning paradigms—Global (GSSL) and Modality (MSSL)—with a jointly optimized objective that combines MSE for forecasting and SSL losses to capture latent heterogeneity. Empirical results on NYC Traffic Demand and BJ Air Quality datasets show MoSSL achieving state-of-the-art accuracy across horizons and modalities, with interpretable insights into heterogeneous components and node-level cross-modality correlations. The approach offers a practical, scalable framework for MoST forecasting with applications in smart cities and environmental monitoring.

Abstract

Multi-modality spatio-temporal (MoST) data extends spatio-temporal (ST) data by incorporating multiple modalities, which is prevalent in monitoring systems, encompassing diverse traffic demands and air quality assessments. Despite significant strides in ST modeling in recent years, there remains a need to emphasize harnessing the potential of information from different modalities. Robust MoST forecasting is more challenging because it possesses (i) high-dimensional and complex internal structures and (ii) dynamic heterogeneity caused by temporal, spatial, and modality variations. In this study, we propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL, which aims to uncover latent patterns from temporal, spatial, and modality perspectives while quantifying dynamic heterogeneity. Experiment results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines. Model implementation is available at https://github.com/beginner-sketch/MoSSL.
Paper Structure (16 sections, 13 equations, 8 figures, 3 tables)

This paper contains 16 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of heterogeneous NYC traffic demand data. Heterogeneity exists in the modalities (Bike Inflow and Taxi Inflow), locations (Food and Residence), and time periods (8 am and 7 pm).
  • Figure 2: The proposed MoSSL framework: (i) Takes a three-dimensional MoST $X$ as input, generating the original representation $H$ through a multi-layered MoST Encoder; (ii) Simultaneously, Multi-modality Data Augmentation refines $X$ into $\Tilde{X}$, feeding it into the shared MoST Encoder to produce the augmented representation $\Tilde{H}$; (iii) With $H$ and $\Tilde{H}$ available, deploys Global Self-Supervised Learning (GSSL) and Modality Self-Supervised Learning (MSSL) to generate losses $\mathcal{L}_g$ and $\mathcal{L}_c$.
  • Figure 3: Efficiency study on the NYC Traffic Demand dataset.
  • Figure 4: Ablation study of MoSSL.
  • Figure 5: Case studies on modality-aware augmentation.
  • ...and 3 more figures