Table of Contents
Fetching ...

Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction

Tu Nguyen

TL;DR

This paper tackles spatiotemporal traffic video prediction by jointly modeling high-resolution spatial structure and long-range temporal dynamics. It proposes a tile-aware cascaded Conv–LSTM with cross-frame additive attention and memory-flexible training, enabling tile-local dynamics and scalable memory management. Theoretical results establish a tight Lipschitz bound for the additive attention, $\|c(\bar{d}) - c(\bar{d}')\|_2 \le \tfrac{1}{2} H L_v L_W \sqrt{T_\text{time}} \; \|\bar{d} - \bar{d}'\|_2$, and a tiling approximation bound of $\tfrac{1}{2} L_{\text{sp}} \Delta_{ij}$ per tile, clarifying stability and spatial tradeoffs. Empirically, the method scales to large traffic maps and delivers competitive forecasting performance across multiple cities, validating memory-flexible tiling as a practical design for real-world traffic systems, with code released for reproducibility.

Abstract

This extended abstract describes our solution for the Traffic4Cast Challenge 2019. The task requires modeling both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long sequences. Building on Conv-LSTM ideas, we introduce a tile-aware, cascaded-memory Conv-LSTM augmented with cross-frame additive attention and a memory-flexible training scheme: frames are sampled per spatial tile so the model learns tile-local dynamics and per-tile memory cells can be updated sparsely, paged, or compressed to scale to large maps. We provide a compact theoretical analysis (tight softmax/attention Lipschitz bound and a tiling error lower bound) explaining stability and the memory-accuracy tradeoffs, and empirically demonstrate improved scalability and competitive forecasting performance on large-scale traffic heatmaps.

Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction

TL;DR

This paper tackles spatiotemporal traffic video prediction by jointly modeling high-resolution spatial structure and long-range temporal dynamics. It proposes a tile-aware cascaded Conv–LSTM with cross-frame additive attention and memory-flexible training, enabling tile-local dynamics and scalable memory management. Theoretical results establish a tight Lipschitz bound for the additive attention, , and a tiling approximation bound of per tile, clarifying stability and spatial tradeoffs. Empirically, the method scales to large traffic maps and delivers competitive forecasting performance across multiple cities, validating memory-flexible tiling as a practical design for real-world traffic systems, with code released for reproducibility.

Abstract

This extended abstract describes our solution for the Traffic4Cast Challenge 2019. The task requires modeling both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long sequences. Building on Conv-LSTM ideas, we introduce a tile-aware, cascaded-memory Conv-LSTM augmented with cross-frame additive attention and a memory-flexible training scheme: frames are sampled per spatial tile so the model learns tile-local dynamics and per-tile memory cells can be updated sparsely, paged, or compressed to scale to large maps. We provide a compact theoretical analysis (tight softmax/attention Lipschitz bound and a tiling error lower bound) explaining stability and the memory-accuracy tradeoffs, and empirically demonstrate improved scalability and competitive forecasting performance on large-scale traffic heatmaps.

Paper Structure

This paper contains 22 sections, 6 theorems, 26 equations, 2 figures, 1 table.

Key Result

Theorem 3.1

Assume that $\|h_t\|_2 \le H$ for all $t$, and that the linear operators satisfy $\|W_\alpha\|_2 \le L_W$ and $\|v_\alpha\|_2 \le L_v$. Then, for any two query vectors $\bar{d}, \bar{d}' \in \mathbb{R}^{d'}$, the corresponding context vectors satisfy

Figures (2)

  • Figure 1: Encoder-decoder architecture
  • Figure 2: Prediction examples for each channel. Diff denotes the difference between model prediction and 12-frame average.

Theorems & Definitions (10)

  • Theorem 3.1: Lipschitz continuity of additive attention
  • proof
  • Theorem 3.2: Tiling approximation bound
  • proof
  • Lemma A.1: Tight spectral norm of softmax Jacobian
  • proof
  • Theorem A.2: Tight Lipschitz continuity of additive attention
  • proof
  • Corollary A.3: Softmax attenuates score gradients
  • Theorem A.4: Tiling approximation bound