Table of Contents
Fetching ...

STDCformer: A Transformer-Based Model with a Spatial-Temporal Causal De-Confounding Strategy for Crowd Flow Prediction

Silu He, Peng Shen, Pingzhen Xu, Qinyao Luo, Haifeng Li

TL;DR

This work addresses crowd flow prediction by reframing the problem as learning a de-confounded, spatial-temporal latent space and a cross-time mapping within that space. The STDCformer architecture implements a Spatial-Temporal De-Confounded (STDC) representation space via backdoor-adjusted attention, a Spatial-Temporal Embedding (STE) for confounder-rich token representations, and a Cross-Time Attention (CTA) module to map past states to future states. It jointly trains an STDC Encoder, CTA-based Past-to-Future Mapping, and STDC Decoder in a Transformer backbone, achieving state-of-the-art performance on IID tests and improved zero-shot generalization to out-of-distribution regions. The approach provides interpretable weights for spatial and temporal confounders and offers a practical pipeline for deploying de-confounded predictions in dynamic urban environments, with publicly offered datasets and baselines for future research.

Abstract

Existing works typically treat spatial-temporal prediction as the task of learning a function $F$ to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding ($E$): learning the intrinsic representation of observations, (2) Cross-Time Mapping ($M$): transforming past representations into future representations, and (3) Decoding ($D$): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning $F = E \cdot M \cdot D$, which includes learning the space transformations $\left\{{E},{D}\right\}$ between the observation space and the hidden representation space, as well as the spatial-temporal mapping $M$ from future states to past states within the representation space. This leads to two key questions: \textbf{Q1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space?} To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.

STDCformer: A Transformer-Based Model with a Spatial-Temporal Causal De-Confounding Strategy for Crowd Flow Prediction

TL;DR

This work addresses crowd flow prediction by reframing the problem as learning a de-confounded, spatial-temporal latent space and a cross-time mapping within that space. The STDCformer architecture implements a Spatial-Temporal De-Confounded (STDC) representation space via backdoor-adjusted attention, a Spatial-Temporal Embedding (STE) for confounder-rich token representations, and a Cross-Time Attention (CTA) module to map past states to future states. It jointly trains an STDC Encoder, CTA-based Past-to-Future Mapping, and STDC Decoder in a Transformer backbone, achieving state-of-the-art performance on IID tests and improved zero-shot generalization to out-of-distribution regions. The approach provides interpretable weights for spatial and temporal confounders and offers a practical pipeline for deploying de-confounded predictions in dynamic urban environments, with publicly offered datasets and baselines for future research.

Abstract

Existing works typically treat spatial-temporal prediction as the task of learning a function to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding (): learning the intrinsic representation of observations, (2) Cross-Time Mapping (): transforming past representations into future representations, and (3) Decoding (): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning , which includes learning the space transformations between the observation space and the hidden representation space, as well as the spatial-temporal mapping from future states to past states within the representation space. This leads to two key questions: \textbf{Q1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space?} To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.

Paper Structure

This paper contains 37 sections, 8 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: The characteristics of Spatial-Temporal Tokens in crowd flow prediction task.
  • Figure 2: Difference between historical and future flow under different time and region sampling.
  • Figure 3: The fusion of STDC strategy and ST Transformer framework. (a) ST Transformer framework. (b) ST Transformer with STDC Strategy.
  • Figure 4: The framework of STGNNs.
  • Figure 5: The formulation of crowd flow prediction. (a) Existing perspective. (b) A novel decomposing perspective proposed in this paper.
  • ...and 16 more figures