Table of Contents
Fetching ...

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

Abstract

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Abstract

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.
Paper Structure (44 sections, 16 theorems, 79 equations, 8 figures)

This paper contains 44 sections, 16 theorems, 79 equations, 8 figures.

Key Result

Theorem 1

For any $\varepsilon, \delta \in \mathbb{R}_{>0}, L \in \mathbb{N}_{\geq 4}, n \in \mathbb{N}_{\geq 5}$, and a bounded probability density function $\mathcal{P}$, there exists a constant $\eta \in \mathbb{R}_{>0}$ such that the following holds. Consider any single-layer softmax attentionOur analysis

Figures (8)

  • Figure 1: Reproduced from barbero2025llmsattendtoken: an attention head that fires on an apostrophe trigger and otherwise attends to BOS.
  • Figure 2: Reproduced from Guo2024ActiveDormantAH: an active--dormant attention head in Llama 2--7B. On code-like inputs (GitHub, top), the head exhibits diverse attention patterns; on text-like inputs (Wikipedia, bottom), it collapses to an attention sink on position 0.
  • Figure 3: Experimental validation: Theoretically analyzed model. (a) Mean attention weights for softmax attention across 1000 test examples with trigger at position 8. Dark regions indicate high attention mass concentrated on BOS (position 1) before the trigger. (b) Standard deviation of softmax attention weights shows negligible variance, confirming stable sink behavior. (c) Mean attention weights for ReLU attention show no sink formation—attention on BOS remains near zero. (d) Standard deviation for ReLU attention confirms consistent behavior across examples.
  • Figure 4: Multi-layer multi-head validation. Attention patterns for a 2-layer 2-head softmax model on a random input (with trigger at position 8). All heads exhibit strong sink behavior.
  • Figure 5: ReLU attention: 2-layer 2-head model. Attention patterns on a single test input (trigger at position 8). No sink formation occurs in any head; attention on BOS remains near zero throughout.
  • ...and 3 more figures

Theorems & Definitions (32)

  • Theorem 1: Single-Layer Attention Sink Necessity
  • proof : Proof sketch (full proof in \ref{['proof:main']})
  • Theorem 2: Multi-Layer Attention Sink Necessity
  • proof : Proof sketch (full proof in \ref{['proof:multilayer']})
  • Theorem 3: ReLU Attention Without Sinks
  • proof : Proof sketch (full proof in \ref{['proof:relu']})
  • Theorem 4: Pre-Trigger Necessity
  • proof
  • Theorem 5: Post-Trigger Necessity
  • proof
  • ...and 22 more