STDCformer: A Transformer-Based Model with a Spatial-Temporal Causal De-Confounding Strategy for Crowd Flow Prediction
Silu He, Peng Shen, Pingzhen Xu, Qinyao Luo, Haifeng Li
TL;DR
This work addresses crowd flow prediction by reframing the problem as learning a de-confounded, spatial-temporal latent space and a cross-time mapping within that space. The STDCformer architecture implements a Spatial-Temporal De-Confounded (STDC) representation space via backdoor-adjusted attention, a Spatial-Temporal Embedding (STE) for confounder-rich token representations, and a Cross-Time Attention (CTA) module to map past states to future states. It jointly trains an STDC Encoder, CTA-based Past-to-Future Mapping, and STDC Decoder in a Transformer backbone, achieving state-of-the-art performance on IID tests and improved zero-shot generalization to out-of-distribution regions. The approach provides interpretable weights for spatial and temporal confounders and offers a practical pipeline for deploying de-confounded predictions in dynamic urban environments, with publicly offered datasets and baselines for future research.
Abstract
Existing works typically treat spatial-temporal prediction as the task of learning a function $F$ to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding ($E$): learning the intrinsic representation of observations, (2) Cross-Time Mapping ($M$): transforming past representations into future representations, and (3) Decoding ($D$): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning $F = E \cdot M \cdot D$, which includes learning the space transformations $\left\{{E},{D}\right\}$ between the observation space and the hidden representation space, as well as the spatial-temporal mapping $M$ from future states to past states within the representation space. This leads to two key questions: \textbf{Q1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space?} To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.
