Attention (as Discrete-Time Markov) Chains

Yotam Erel; Olaf Dünkel; Rishabh Dabral; Vladislav Golyanik; Christian Theobalt; Amit H. Bermano

Attention (as Discrete-Time Markov) Chains

Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit H. Bermano

TL;DR

This work reframes Transformer attention as a discrete-time Markov chain (DTMC) by treating the post-softmax attention matrix $\mathbf{A}$ as a transition operator. By propagating attention through multiple steps (multi-bounce), it reveals metastable states where semantically related tokens cluster, and it defines TokenRank as the stationary distribution to measure global token importance. The authors introduce $\lambda_2$-weighted head averaging to emphasize faster or slower mixing, and demonstrate practical gains in zero-shot image segmentation, improved attention visualization, and enhancements to generation and diffusion-based segmentation methods. This DTMC perspective provides a principled, scalable framework for analyzing and leveraging attention in visual transformers with broad potential extensions. Mathematically, concepts such as $\mathbf{A}$ being right-stochastic, $\mathbf{A}^T \mathbf{v}_{ss} = \mathbf{v}_{ss}$, and PageRank-style refinements underpin the approach, enabling both interpretability and improved downstream performance.

Abstract

We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our key observation is that tokens linked to semantically similar regions form metastable states, i.e., regions where attention tends to concentrate, while noisy attention scores dissipate. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We show that TokenRank enhances unconditional image generation, improving both quality (IS) and diversity (FID), and can also be incorporated into existing segmentation techniques to improve their performance over existing benchmarks. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

Attention (as Discrete-Time Markov) Chains

TL;DR

Abstract

Attention (as Discrete-Time Markov) Chains

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)