Table of Contents
Fetching ...

Causal Spatio-Temporal Prediction: An Effective and Efficient Multi-Modal Approach

Yuting Huang, Ziquan Fang, Zhihao Zeng, Lu Chen, Yunjun Gao

TL;DR

E$^2$-CSTP tackles multi-modal spatio-temporal forecasting by integrating cross-modal attention with adaptive gating, and by disentangling causal influences through a dual-branch intervention framework. It combines a DeepSHAP-informed causal graph with backdoor adjustment, and a hybrid GCN–Mamba STED encoder to deliver scalable, accurate predictions while reducing computational overhead relative to Transformer baselines. Across four real-world datasets, the approach achieves consistent accuracy gains (up to 9.66%) and substantial efficiency improvements (up to 56.11% faster per epoch), with ablations confirming the essential role of each component. The work advances practical spatio-temporal forecasting by enabling robust, efficient multi-modal modeling under confounding, with strong implications for transportation, climate, and urban planning applications.

Abstract

Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.

Causal Spatio-Temporal Prediction: An Effective and Efficient Multi-Modal Approach

TL;DR

E-CSTP tackles multi-modal spatio-temporal forecasting by integrating cross-modal attention with adaptive gating, and by disentangling causal influences through a dual-branch intervention framework. It combines a DeepSHAP-informed causal graph with backdoor adjustment, and a hybrid GCN–Mamba STED encoder to deliver scalable, accurate predictions while reducing computational overhead relative to Transformer baselines. Across four real-world datasets, the approach achieves consistent accuracy gains (up to 9.66%) and substantial efficiency improvements (up to 56.11% faster per epoch), with ablations confirming the essential role of each component. The work advances practical spatio-temporal forecasting by enabling robust, efficient multi-modal modeling under confounding, with strong implications for transportation, climate, and urban planning applications.

Abstract

Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.

Paper Structure

This paper contains 31 sections, 31 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Multi-modal and confounded relations.
  • Figure 2: The overall framework of E$^2$-CSTP.
  • Figure 3: The ablation study.
  • Figure 4: Model efficiency comparison on the total training time.
  • Figure 5: Efficiency under prediction variants on Terra.
  • ...and 4 more figures