The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Shangwen Sun; Alfredo Canziani; Yann LeCun; Jiachen Zhu

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu

TL;DR

The pre-norm configuration is identified as the key choice that enables the co-occurrence of massive activations and attention sinks, and it is shown that ablating it causes the two phenomena to decouple.

Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

TL;DR

Abstract

Paper Structure (51 sections, 3 theorems, 40 equations, 8 figures, 10 tables)

This paper contains 51 sections, 3 theorems, 40 equations, 8 figures, 10 tables.

Introduction
Preliminaries
Next-Token Prediction
Transformer Architecture
Token embedding.
Transformer layers.
Attention block.
Feed-forward block.
Prediction head.
From Spikes to Sinks
The Emergence of Massive Activations
The Life Cycle of Massive Activations
Step-up blocks.
Residual accumulation.
Step-down blocks.
...and 36 more sections

Key Result

Theorem 2.1

Let $\mathbf{O}^{(h)}\coloneqq\mathbf{A}^{(h)} \mathbf{V}^{(h)} \in \mathbb{R}^{T \times \mathrm{d}_{\mathrm{head}}}$ denote the output of head $h$ with $\mathbf{V}^{(h)}\!\coloneqq\!\tilde{\mathbf{H}} \mathbf{W}_{\mathrm{V}}^{(h)}$ and $\tilde{\mathbf{H}}$ being the hidden representations of inputs with $\mathbf{W}_{\mathrm{O}}^{(h)} \in \mathbb{R}^{\mathrm{d}_{\mathrm{head}} \times \mathrm{d}_{\

Figures (8)

Figure 1: Top-3 channel magnitudes across depth in Llama 2 7B and Qwen3 8B (post-residuals vs. block outputs). In both models, early blocks inject massive activations that persist through most of the network before being neutralized by late blocks.
Figure 2: Input-output characteristics of $\operatorname{SiLU}$ in step-up and step-down blocks of Llama 2 7B. Based on $1024$ randomly sampled sentences from C4 dataset raffel2020exploring, we plot the cosine similarity and norm ratio for each token. Points are colored by the maximum magnitude of the block output. For spike tokens (red points), both direction and norm remain largely unchanged, indicating that the $\operatorname{SiLU}$ gate operates in a near-identity regime.
Figure 3: Frobenius norms $\|\mathbf{U}_k\|_F$ for the quadratic forms in Llama 2 7B. Spike channels align with $\mathbf{U}_k$ matrices that have substantially larger norms than typical channels. These high-norm coordinates appear exclusively in step-up and step-down blocks.
Figure 4: Eigenvalue spectra of $\mathbf{S}_k$ for spike vs. non-spike channels in Llama 2 7B. Spike channels exhibit a single dominant eigenvalue $\lambda_\star$ that is orders of magnitude larger than the remainder of the spectrum; non-spike channels show no such outlier.
Figure 5: Cosine similarity among spike tokens before and after step-up block in Llama 2 7B. Pre-step-up representations vary across spike tokens, but post-step-up representations collapse to nearly identical directions, empirically validating the near-constant approximation.
...and 3 more figures

Theorems & Definitions (6)

Theorem 2.1: Attention output as a sum over heads
proof
Theorem 2.2: Quadratic-form approximation of a SiLU feed-forward coordinate
proof
Theorem 2.3: Coordinate bound under RMS normalization
proof

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

TL;DR

Abstract

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)