Table of Contents
Fetching ...

Infinite Self-Attention

Giorgio Roffo

TL;DR

Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix, is proposed, which keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers.

Abstract

The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).

Infinite Self-Attention

TL;DR

Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix, is proposed, which keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers.

Abstract

The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).
Paper Structure (44 sections, 29 equations, 18 figures, 5 tables)

This paper contains 44 sections, 29 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Comparison of attention graphs. Visualization of ViT-L/16 attention maps on ImageNet. Softmax attention distributes focus across background regions, while InfSA variants produce sharper, object-aligned activations.
  • Figure 2: (a) InfSA in a Pre-LN ViT block. Two InfSA variants within standard Transformer scaffolding: (1) Pure InfSA uses full attention with ReLU and Frobenius normalization, accumulating discounted outputs across layers; (2) Linear InfSA computes soft token scores, pools values per head, and broadcasts context with per-layer scaling. Both are drop-in compatible with Transformer blocks. (b) Efficiency by complexity tier (4L, $\mathbf{1024^2}$). Inference throughput vs. energy per image for nine attention mechanisms, colored by asymptotic complexity. InfViT Linear ($\mathcal{O}(N)$, red star) achieves the highest throughput at the lowest energy cost.
  • Figure 3: Softmax attention (1-hop) vs. InfSA (Neumann series).Left: Frobenius-normalized $\hat{A}$ ($\|\hat{A}\|_F{=}1$); row sums vary, unlike softmax. Middle: Absorbing Markov chain $\mathbf{M}{=}\gamma\hat{A}$; dashed red arrows show absorption $R_i{=}1{-}\sum_j \mathbf{M}_{ij}$ into $\mathfrak{a}$. Right: Starting from the all-ones input $\mathbf{e}$ (Eq. \ref{['eq:final_score']}): softmax attention (row-stochastic, 1-hop) ranks token 0 first via column sums, since many tokens directly attend to it. InfSA iterates $\mathbf{M}$ further; at $n{=}2$ the chain $0{\to}3{\to}4$ redirects mass to token 4, and the Katz centrality $c^{\mathrm{in}}$ correctly identifies token 4 as globally most important---the multi-hop outcome standard self-attention misses.
  • Figure 4: Efficiency dashboard (4L-64H, inference at $\mathbf{1024^2}$). Speed-up over Standard ViT, absolute throughput, and energy per image. InfViT Linear ($\mathcal{O}(N)$) achieves $13.4{\times}$ speed-up at 0.87 J/img.
  • Figure 5: 4L vs. 24L depth comparison (inference at $\mathbf{1024^2}$). Throughput (left) and energy (right) for the 4L-64H and 24L-16H configurations. InfViT Linear leads in both regimes; the gap narrows at 24L due to the higher fixed overhead of deeper networks, but the ranking is preserved.
  • ...and 13 more figures