Table of Contents
Fetching ...

Spatially Constrained Transformer with Efficient Global Relation Modelling for Spatio-Temporal Prediction

Ashutosh Sao, Simon Gottschalk

TL;DR

ST-SampleNet is proposed, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively and introduces a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of the approach.

Abstract

Accurate spatio-temporal prediction is crucial for the sustainable development of smart cities. However, current approaches often struggle to capture important spatio-temporal relationships, particularly overlooking global relations among distant city regions. Most existing techniques predominantly rely on Convolutional Neural Networks (CNNs) to capture global relations. However, CNNs exhibit neighbourhood bias, making them insufficient for capturing distant relations. To address this limitation, we propose ST-SampleNet, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively. Moreover, as the number of regions increases, the quadratic complexity of self-attention becomes a challenge. To tackle this issue, we introduce a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of our approach. Furthermore, we introduce a spatially constrained position embedding that incorporates spatial neighbourhood information into the self-attention mechanism, aiding in semantic interpretation and improving the performance of ST-SampleNet. Our experimental evaluation on three real-world datasets demonstrates the effectiveness of ST-SampleNet. Additionally, our efficient variant achieves a 40% reduction in computational costs with only a marginal compromise in performance, approximately 1%.

Spatially Constrained Transformer with Efficient Global Relation Modelling for Spatio-Temporal Prediction

TL;DR

ST-SampleNet is proposed, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively and introduces a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of the approach.

Abstract

Accurate spatio-temporal prediction is crucial for the sustainable development of smart cities. However, current approaches often struggle to capture important spatio-temporal relationships, particularly overlooking global relations among distant city regions. Most existing techniques predominantly rely on Convolutional Neural Networks (CNNs) to capture global relations. However, CNNs exhibit neighbourhood bias, making them insufficient for capturing distant relations. To address this limitation, we propose ST-SampleNet, a novel transformer-based architecture that combines CNNs with self-attention mechanisms to capture both local and global relations effectively. Moreover, as the number of regions increases, the quadratic complexity of self-attention becomes a challenge. To tackle this issue, we introduce a lightweight region sampling strategy that prunes non-essential regions and enhances the efficiency of our approach. Furthermore, we introduce a spatially constrained position embedding that incorporates spatial neighbourhood information into the self-attention mechanism, aiding in semantic interpretation and improving the performance of ST-SampleNet. Our experimental evaluation on three real-world datasets demonstrates the effectiveness of ST-SampleNet. Additionally, our efficient variant achieves a 40% reduction in computational costs with only a marginal compromise in performance, approximately 1%.

Paper Structure

This paper contains 33 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Hannover traffic between $8-9$ am of a working day and the corresponding attention map from ST-SampleNet. The sparse attention demonstrates the dependency on only a few regions for prediction.
  • Figure 2: ST-SampleNet architecture has three main components: (i) Spatial Encoder -- learns the spatial dependency amongst different regions, (ii) Temporal Encoder -- learns the temporal dependency amongst input time intervals and (iii) Predictor -- makes the final predictions.
  • Figure 3: The city is divided into multiple levels of granularities ($\mathcal{H}_1$, $\mathcal{H}_2$, $\mathcal{H}_3$). Embeddings of different levels are concatenated to generate the position embedding (SCPE) of a region $r^n$. Map data: © OpenStreetMap contributors, ODbL.
  • Figure 4: Effect of sampling on density prediction and corresponding computational cost for Hannover city. On x-axis is the ratio of region kept after sampling, on y-axis (left) is the RMSE for density prediction and on y-axis (right) is the corresponding GFLOPS.
  • Figure 5: Visualisation of learned position embeddings in Hannover.
  • ...and 1 more figures