Table of Contents
Fetching ...

Linear Attention is Enough in Spatial-Temporal Forecasting

Xinyu Ning

TL;DR

This work reframes traffic forecasting by treating each sensor at every time step as an independent ST-Token and applying a vanilla Transformer, achieving state-of-the-art results without relying on fixed road graphs. To scale to larger networks, it introduces NSTformer, which uses Nyström-based linear attention to reduce complexity from $O(N^2T^2)$ to $O(NT)$ while maintaining competitive performance. The STformer demonstrates strong empirical gains on METR-LA and PEMS-BAY, and the NSTformer often matches or slightly exceeds STformer, suggesting approximate attention can offer regularization benefits. Overall, the study shows that pure attention is powerful for spatial-temporal forecasting and provides a scalable, graph-free approach for forecasting in dynamic road networks.

Abstract

As the most representative scenario of spatial-temporal forecasting tasks, the traffic forecasting task attracted numerous attention from machine learning community due to its intricate correlation both in space and time dimension. Existing methods often treat road networks over time as spatial-temporal graphs, addressing spatial and temporal representations independently. However, these approaches struggle to capture the dynamic topology of road networks, encounter issues with message passing mechanisms and over-smoothing, and face challenges in learning spatial and temporal relationships separately. To address these limitations, we propose treating nodes in road networks at different time steps as independent spatial-temporal tokens and feeding them into a vanilla Transformer to learn complex spatial-temporal patterns, design \textbf{STformer} achieving SOTA. Given its quadratic complexity, we introduce a variant \textbf{NSTformer} based on Nystr$\ddot{o}$m method to approximate self-attention with linear complexity but even slightly better than former in a few cases astonishingly. Extensive experimental results on traffic datasets demonstrate that the proposed method achieves state-of-the-art performance at an affordable computational cost. Our code is available at \href{https://github.com/XinyuNing/STformer-and-NSTformer}{https://github.com/XinyuNing/STformer-and-NSTformer}.

Linear Attention is Enough in Spatial-Temporal Forecasting

TL;DR

This work reframes traffic forecasting by treating each sensor at every time step as an independent ST-Token and applying a vanilla Transformer, achieving state-of-the-art results without relying on fixed road graphs. To scale to larger networks, it introduces NSTformer, which uses Nyström-based linear attention to reduce complexity from to while maintaining competitive performance. The STformer demonstrates strong empirical gains on METR-LA and PEMS-BAY, and the NSTformer often matches or slightly exceeds STformer, suggesting approximate attention can offer regularization benefits. Overall, the study shows that pure attention is powerful for spatial-temporal forecasting and provides a scalable, graph-free approach for forecasting in dynamic road networks.

Abstract

As the most representative scenario of spatial-temporal forecasting tasks, the traffic forecasting task attracted numerous attention from machine learning community due to its intricate correlation both in space and time dimension. Existing methods often treat road networks over time as spatial-temporal graphs, addressing spatial and temporal representations independently. However, these approaches struggle to capture the dynamic topology of road networks, encounter issues with message passing mechanisms and over-smoothing, and face challenges in learning spatial and temporal relationships separately. To address these limitations, we propose treating nodes in road networks at different time steps as independent spatial-temporal tokens and feeding them into a vanilla Transformer to learn complex spatial-temporal patterns, design \textbf{STformer} achieving SOTA. Given its quadratic complexity, we introduce a variant \textbf{NSTformer} based on Nystrm method to approximate self-attention with linear complexity but even slightly better than former in a few cases astonishingly. Extensive experimental results on traffic datasets demonstrate that the proposed method achieves state-of-the-art performance at an affordable computational cost. Our code is available at \href{https://github.com/XinyuNing/STformer-and-NSTformer}{https://github.com/XinyuNing/STformer-and-NSTformer}.
Paper Structure (20 sections, 7 equations, 1 figure, 3 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 1 figure, 3 tables, 2 algorithms.

Figures (1)

  • Figure 1: The Architecture of STformer and NSTformer