Table of Contents
Fetching ...

Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink

Victoria Hankemeier, Malte Hankemeier

TL;DR

This work analyzes how temporal attention in spatio-temporal graphs can degenerate information flow through a diagonal attention sink that intensifies with longer sequences. By deriving the Jacobian of temporal attention, the authors decompose the TA dynamics into value, key, and query pathways and establish bounds showing off-diagonal influence decays with sequence length while diagonal updates remain strong. They identify a stochastic parroting effect and propose three regularizers—diagonal mask, diagonal dropout, and diagonal penalty—to rebalance information flow, complemented by residual connections. Empirical results on METR-LA demonstrate that diagonal regularizers, especially diagonal penalty, improve forecasting accuracy for longer horizons and yield interpretable attention patterns, offering a practical approach to stabilize temporal information in spatio-temporal GNNs.

Abstract

Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.

Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink

TL;DR

This work analyzes how temporal attention in spatio-temporal graphs can degenerate information flow through a diagonal attention sink that intensifies with longer sequences. By deriving the Jacobian of temporal attention, the authors decompose the TA dynamics into value, key, and query pathways and establish bounds showing off-diagonal influence decays with sequence length while diagonal updates remain strong. They identify a stochastic parroting effect and propose three regularizers—diagonal mask, diagonal dropout, and diagonal penalty—to rebalance information flow, complemented by residual connections. Empirical results on METR-LA demonstrate that diagonal regularizers, especially diagonal penalty, improve forecasting accuracy for longer horizons and yield interpretable attention patterns, offering a practical approach to stabilize temporal information in spatio-temporal GNNs.

Abstract

Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.
Paper Structure (15 sections, 18 equations, 2 figures, 2 tables)