Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction
Tu Nguyen
TL;DR
This paper tackles spatiotemporal traffic video prediction by jointly modeling high-resolution spatial structure and long-range temporal dynamics. It proposes a tile-aware cascaded Conv–LSTM with cross-frame additive attention and memory-flexible training, enabling tile-local dynamics and scalable memory management. Theoretical results establish a tight Lipschitz bound for the additive attention, $\|c(\bar{d}) - c(\bar{d}')\|_2 \le \tfrac{1}{2} H L_v L_W \sqrt{T_\text{time}} \; \|\bar{d} - \bar{d}'\|_2$, and a tiling approximation bound of $\tfrac{1}{2} L_{\text{sp}} \Delta_{ij}$ per tile, clarifying stability and spatial tradeoffs. Empirically, the method scales to large traffic maps and delivers competitive forecasting performance across multiple cities, validating memory-flexible tiling as a practical design for real-world traffic systems, with code released for reproducibility.
Abstract
This extended abstract describes our solution for the Traffic4Cast Challenge 2019. The task requires modeling both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long sequences. Building on Conv-LSTM ideas, we introduce a tile-aware, cascaded-memory Conv-LSTM augmented with cross-frame additive attention and a memory-flexible training scheme: frames are sampled per spatial tile so the model learns tile-local dynamics and per-tile memory cells can be updated sparsely, paged, or compressed to scale to large maps. We provide a compact theoretical analysis (tight softmax/attention Lipschitz bound and a tiling error lower bound) explaining stability and the memory-accuracy tradeoffs, and empirically demonstrate improved scalability and competitive forecasting performance on large-scale traffic heatmaps.
