xMTrans: Temporal Attentive Cross-Modality Fusion Transformer for Long-Term Traffic Prediction

Huy Quang Ung; Hao Niu; Minh-Son Dao; Shinya Wada; Atsunori Minamikawa

xMTrans: Temporal Attentive Cross-Modality Fusion Transformer for Long-Term Traffic Prediction

Huy Quang Ung, Hao Niu, Minh-Son Dao, Shinya Wada, Atsunori Minamikawa

TL;DR

The paper tackles long-term traffic prediction by leveraging multi-modal data to capture temporal correlations between a target modality and a mobility-based support modality. It introduces xMTrans, a cross-modality temporal attentive transformer with attention-based temporal embedding, masked cross-modal fusion via self-attention and temporal-attention modules, and a multi-resolution recursive training strategy. Empirical results on traffic congestion length and taxi-demand datasets show state-of-the-art performance, with ablations confirming the importance of the SM and temporal attention components. The approach offers a practical, scalable framework for integrating mobility and environmental cues into LTTP, with potential benefits for urban traffic management and planning.

Abstract

Traffic predictions play a crucial role in intelligent transportation systems. The rapid development of IoT devices allows us to collect different kinds of data with high correlations to traffic predictions, fostering the development of efficient multi-modal traffic prediction models. Until now, there are few studies focusing on utilizing advantages of multi-modal data for traffic predictions. In this paper, we introduce a novel temporal attentive cross-modality transformer model for long-term traffic predictions, namely xMTrans, with capability of exploring the temporal correlations between the data of two modalities: one target modality (for prediction, e.g., traffic congestion) and one support modality (e.g., people flow). We conducted extensive experiments to evaluate our proposed model on traffic congestion and taxi demand predictions using real-world datasets. The results showed the superiority of xMTrans against recent state-of-the-art methods on long-term traffic predictions. In addition, we also conducted a comprehensive ablation study to further analyze the effectiveness of each module in xMTrans.

xMTrans: Temporal Attentive Cross-Modality Fusion Transformer for Long-Term Traffic Prediction

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 4 figures, 5 tables)

This paper contains 22 sections, 2 equations, 4 figures, 5 tables.

Introduction
Related Work
Problem Formulation
Proposed Method
Attention-based Temporal Embedding
Cross-Modality Fusion Layer
Masked Multi-head Self-attention Module
Masked Multi-head Temporal Attention Module
Multi-Resolution Recursive Training Strategy
Temporal Multi-Resolutions
Spatial Multi-Resolutions
Evaluations
Dataset
Traffic Congestion Length Dataset (TCL)
People-Flow Dataset (PF)
...and 7 more sections

Figures (4)

Figure 1: Overview of our proposed architecture.
Figure 2: Illustration of temporal multi-resolution training strategy and spatial multi-resolution formulation.
Figure 3: Examples of predictions from our xMTrans, its uni-modal version, and the highlighted baselines. The predictions start from $(t=48)$. $(t=0)$ is a time step at 0:00, while $(t=48,71)$ are 0:00 and 11:45 of the next following day, respectively.
Figure 4: Average maps of eight attention maps from the masked multi-head attention module in the last cross-modality fusion layer.

xMTrans: Temporal Attentive Cross-Modality Fusion Transformer for Long-Term Traffic Prediction

TL;DR

Abstract

xMTrans: Temporal Attentive Cross-Modality Fusion Transformer for Long-Term Traffic Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)