Multimodal joint prediction of traffic spatial-temporal data with graph sparse attention mechanism and bidirectional temporal convolutional network
Dongran Zhang, Jiangnan Yan, Kemal Polat, Adi Alhudhaif, Jun Li
TL;DR
This work tackles multimodal, short-term traffic prediction by introducing GSABT, which couples a Graph Sparse Attention mechanism with a Bidirectional Temporal Convolutional Network. Spatial features are learned through a multimodal graph attention module for local relations and a Top-U sparse attention for global interactions, enabling robust cross-modal coupling. Temporal features are captured via a Shared BiTCN and modality-specific Unique BiTCN components, producing rich inter- and intra-modal temporal representations that feed an MLP predictor. Experiments on three real-world datasets show GSABT achieves state-of-the-art performance across various joint-prediction settings, demonstrating strong generalization and scalability for multimodal traffic forecasting. The framework is designed to be extensible across spatial and temporal dimensions and holds promise for broader traffic-system analytics and integration with downstream AI systems.
Abstract
Traffic flow prediction plays a crucial role in the management and operation of urban transportation systems. While extensive research has been conducted on predictions for individual transportation modes, there is relatively limited research on joint prediction across different transportation modes. Furthermore, existing multimodal traffic joint modeling methods often lack flexibility in spatial-temporal feature extraction. To address these issues, we propose a method called Graph Sparse Attention Mechanism with Bidirectional Temporal Convolutional Network (GSABT) for multimodal traffic spatial-temporal joint prediction. First, we use a multimodal graph multiplied by self-attention weights to capture spatial local features, and then employ the Top-U sparse attention mechanism to obtain spatial global features. Second, we utilize a bidirectional temporal convolutional network to enhance the temporal feature correlation between the output and input data, and extract inter-modal and intra-modal temporal features through the share-unique module. Finally, we have designed a multimodal joint prediction framework that can be flexibly extended to both spatial and temporal dimensions. Extensive experiments conducted on three real datasets indicate that the proposed model consistently achieves state-of-the-art predictive performance.
