Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

Abstract

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Abstract

Paper Structure (44 sections, 16 theorems, 79 equations, 8 figures)

This paper contains 44 sections, 16 theorems, 79 equations, 8 figures.

Introduction
Sinks Empirically Enable No-Op Behaviors in Real Models
Theory and Results
Notation and Setup
Task Definition
Input Distribution
Target Output
Loss Function
Task Motivation and Justification
Model Architecture
Softmax Attention.
ReLU Attention.
Multi-Layer Attention.
Main Result
Experiments
...and 29 more sections

Key Result

Theorem 1

For any $\varepsilon, \delta \in \mathbb{R}_{>0}, L \in \mathbb{N}_{\geq 4}, n \in \mathbb{N}_{\geq 5}$, and a bounded probability density function $\mathcal{P}$, there exists a constant $\eta \in \mathbb{R}_{>0}$ such that the following holds. Consider any single-layer softmax attentionOur analysis

Figures (8)

Figure 1: Reproduced from barbero2025llmsattendtoken: an attention head that fires on an apostrophe trigger and otherwise attends to BOS.
Figure 2: Reproduced from Guo2024ActiveDormantAH: an active--dormant attention head in Llama 2--7B. On code-like inputs (GitHub, top), the head exhibits diverse attention patterns; on text-like inputs (Wikipedia, bottom), it collapses to an attention sink on position 0.
Figure 3: Experimental validation: Theoretically analyzed model. (a) Mean attention weights for softmax attention across 1000 test examples with trigger at position 8. Dark regions indicate high attention mass concentrated on BOS (position 1) before the trigger. (b) Standard deviation of softmax attention weights shows negligible variance, confirming stable sink behavior. (c) Mean attention weights for ReLU attention show no sink formation—attention on BOS remains near zero. (d) Standard deviation for ReLU attention confirms consistent behavior across examples.
Figure 4: Multi-layer multi-head validation. Attention patterns for a 2-layer 2-head softmax model on a random input (with trigger at position 8). All heads exhibit strong sink behavior.
Figure 5: ReLU attention: 2-layer 2-head model. Attention patterns on a single test input (trigger at position 8). No sink formation occurs in any head; attention on BOS remains near zero throughout.
...and 3 more figures

Theorems & Definitions (32)

Theorem 1: Single-Layer Attention Sink Necessity
proof : Proof sketch (full proof in \ref{['proof:main']})
Theorem 2: Multi-Layer Attention Sink Necessity
proof : Proof sketch (full proof in \ref{['proof:multilayer']})
Theorem 3: ReLU Attention Without Sinks
proof : Proof sketch (full proof in \ref{['proof:relu']})
Theorem 4: Pre-Trigger Necessity
proof
Theorem 5: Post-Trigger Necessity
proof
...and 22 more

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Abstract

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (32)