Table of Contents
Fetching ...

Attention (as Discrete-Time Markov) Chains

Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit H. Bermano

TL;DR

This work reframes Transformer attention as a discrete-time Markov chain (DTMC) by treating the post-softmax attention matrix $\mathbf{A}$ as a transition operator. By propagating attention through multiple steps (multi-bounce), it reveals metastable states where semantically related tokens cluster, and it defines TokenRank as the stationary distribution to measure global token importance. The authors introduce $\lambda_2$-weighted head averaging to emphasize faster or slower mixing, and demonstrate practical gains in zero-shot image segmentation, improved attention visualization, and enhancements to generation and diffusion-based segmentation methods. This DTMC perspective provides a principled, scalable framework for analyzing and leveraging attention in visual transformers with broad potential extensions. Mathematically, concepts such as $\mathbf{A}$ being right-stochastic, $\mathbf{A}^T \mathbf{v}_{ss} = \mathbf{v}_{ss}$, and PageRank-style refinements underpin the approach, enabling both interpretability and improved downstream performance.

Abstract

We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our key observation is that tokens linked to semantically similar regions form metastable states, i.e., regions where attention tends to concentrate, while noisy attention scores dissipate. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We show that TokenRank enhances unconditional image generation, improving both quality (IS) and diversity (FID), and can also be incorporated into existing segmentation techniques to improve their performance over existing benchmarks. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

Attention (as Discrete-Time Markov) Chains

TL;DR

This work reframes Transformer attention as a discrete-time Markov chain (DTMC) by treating the post-softmax attention matrix as a transition operator. By propagating attention through multiple steps (multi-bounce), it reveals metastable states where semantically related tokens cluster, and it defines TokenRank as the stationary distribution to measure global token importance. The authors introduce -weighted head averaging to emphasize faster or slower mixing, and demonstrate practical gains in zero-shot image segmentation, improved attention visualization, and enhancements to generation and diffusion-based segmentation methods. This DTMC perspective provides a principled, scalable framework for analyzing and leveraging attention in visual transformers with broad potential extensions. Mathematically, concepts such as being right-stochastic, , and PageRank-style refinements underpin the approach, enabling both interpretability and improved downstream performance.

Abstract

We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our key observation is that tokens linked to semantically similar regions form metastable states, i.e., regions where attention tends to concentrate, while noisy attention scores dissipate. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We show that TokenRank enhances unconditional image generation, improving both quality (IS) and diversity (FID), and can also be incorporated into existing segmentation techniques to improve their performance over existing benchmarks. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

Paper Structure

This paper contains 44 sections, 5 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Attention Chains interprets attention matrices as Markov chains. The 1st order bounce ($n=1$) corresponds to a common row-select operation from an attention matrix (top: the token "cat", bottom: "tie"). Iteratively computing the nth order attention bounce models higher-order attention effects, eventually yielding a stationary vector ($n\rightarrow\infty$) that globally captures the flow of attention into each token (TokenRank). Intermediate iterations result in sharper segmentation maps.
  • Figure 2: Illustration of higher order effects.Left: Attention matrix $A$ with sequence length 5. Middle: A DTMC with transition probabilities defined by matrix $A$, where only strong connections are shown. Right (One-Hot): To evaluate where state-4 attends to, we can iterate using the power method once starting from a one-hot vector ($n=0$), which results in the row-select operation ($n=1$). However, this first-order approximation is insufficient since state-0 mostly transitions to state-3 and, therefore, state-4 indirectly attends state-3. This becomes evident as we iterate further ($n=2$). Right (Uniform): To compute a global token ranking, we can iterate starting from a uniform state ($n=0$), resulting in a per-column sum operation ($n=1$). This indicates state-0 as most important because many states have a high probability of transitioning into state-0. However, state-0 maps to state-3 with high probability, and state-3 maps to state-4 with high probability. Therefore, the importance of state-4 should be elevated. When considering the second bounce ($n=2$), more probability mass is directed into state-3, and with a sufficient number of iterations the steady state ($v^T_{ss}$) ranks state-4 as the most important state globally, which aligns with the intuition above.
  • Figure 3: ImageNet segmentation. Considering higher order attention effects improves results. We visualize the raw attention output (colored) and the binary segmentation masks. We present more qualitative comparisons in \ref{['sec:quali_segm_results']} .
  • Figure 4: Global incoming attention. Visualizations are computed after averaging over heads for four different layers of DINOv1. While the center token only attends to the local neighborhood for earlier layers, column sum results in noisy attention visualizations. In contrast, TokenRank captures global incoming attention on par with the CLS token that was explicitly trained to capture global attention for DINOv1. We show per-head visualizations in \ref{['sec:per_head_viz']}.
  • Figure 5: Qualitative results for SAG. Images generated using TokenRank have less artifacts and are more structured.
  • ...and 12 more figures