Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

Jiaqi Lin; Qianqian Ren

Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

Jiaqi Lin, Qianqian Ren

TL;DR

The paper tackles the challenge of accurate traffic prediction by modeling rich spatio-temporal dependencies. It introduces LVSTformer, a multi-level, multi-view augmented spatio-temporal transformer that combines a spatio-temporal embedding layer, three parallel spatial attention views (local, global, pivotal), gated temporal self-attention, and a spatio-temporal context broadcasting mechanism. Empirical results on six real-world traffic benchmarks show state-of-the-art performance with notable improvements in MAE, along with comprehensive ablations, long-term forecasts, and cost-efficiency analysis. The work advances traffic forecasting by enhancing multi-scale spatial modeling, temporal dynamics, and generalization through balanced attention, with practical impact for ITS applications.

Abstract

Traffic prediction is a challenging spatio-temporal forecasting problem that involves highly complex spatio-temporal correlations. This paper proposes a Multi-level Multi-view Augmented Spatio-temporal Transformer (LVSTformer) for traffic prediction. The model aims to capture spatial dependencies from three different levels: local geographic, global semantic, and pivotal nodes, along with long- and short-term temporal dependencies. Specifically, we design three spatial augmented views to delve into the spatial information from the perspectives of local, global, and pivotal nodes. By combining three spatial augmented views with three parallel spatial self-attention mechanisms, the model can comprehensively captures spatial dependencies at different levels. We design a gated temporal self-attention mechanism to effectively capture long- and short-term temporal dependencies. Furthermore, a spatio-temporal context broadcasting module is introduced between two spatio-temporal layers to ensure a well-distributed allocation of attention scores, alleviating overfitting and information loss, and enhancing the generalization ability and robustness of the model. A comprehensive set of experiments is conducted on six well-known traffic benchmarks, the experimental results demonstrate that LVSTformer achieves state-of-the-art performance compared to competing baselines, with the maximum improvement reaching up to 4.32%.

Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

TL;DR

Abstract

Paper Structure (37 sections, 26 equations, 9 figures, 5 tables)

This paper contains 37 sections, 26 equations, 9 figures, 5 tables.

Introduction
Related Work
Traffic Prediction
Spatio-Temporal Graph Neural Networks(STGNNs)
Attention Mechanism and Transformer
Miscellaneous
Problem Formulation
METHODOLOGY
Spatio-Temporal Embedding Layer
Raw Data Embedding
Temporal Periodic Embedding
Temporal Position Encoding
Spatial Graph Laplacian Embedding
Data Embedding Output
Multi-view Generation
...and 22 more sections

Figures (9)

Figure 1: Performance comparisons with respect to MAE on six traffic datasets. Our LVSTformer achieves the best performance.
Figure 2: Figure (a) illustrates the region division and sensors deployment, while figures (b) and (c) respectively demonstrate the local and global spatial dependencies in traffic data. Figure (d) counts the input and output flows of each node, and figure (e) showcases the periodicity of traffic flow.
Figure 3: The architecture of the LVSTformer: (a) Embedding Layer aggregates raw traffic data, temporal periodic features, and spatial features to effectively model the spatio-temporal features of traffic data. (b) Multi-level Spatio-Temporal Transformer captures temporal dependencies through the gated self-attention, and spatial dependencies through spatial self-attention, which consists of three modules, local geographic self-attention(LGSA), global semantic self-attention(GSSA), and pivotal nodes self-attention(PNSA). (c) Multi-view Generation constructs local view, global view, and pivotal view, which are integrated with spatial self-attention. (d) The details of LGSA, GSSA and PNSA, they share the same architecture.
Figure 4: The structure of STCB.
Figure 5: Comparison results of different methods for multi-step prediction on the PeMS-BAY and PeMS08 datasets.
...and 4 more figures

Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

TL;DR

Abstract

Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (9)