Table of Contents
Fetching ...

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu

TL;DR

The pre-norm configuration is identified as the key choice that enables the co-occurrence of massive activations and attention sinks, and it is shown that ablating it causes the two phenomena to decouple.

Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

TL;DR

The pre-norm configuration is identified as the key choice that enables the co-occurrence of massive activations and attention sinks, and it is shown that ablating it causes the two phenomena to decouple.

Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
Paper Structure (51 sections, 3 theorems, 40 equations, 8 figures, 10 tables)

This paper contains 51 sections, 3 theorems, 40 equations, 8 figures, 10 tables.

Key Result

Theorem 2.1

Let $\mathbf{O}^{(h)}\coloneqq\mathbf{A}^{(h)} \mathbf{V}^{(h)} \in \mathbb{R}^{T \times \mathrm{d}_{\mathrm{head}}}$ denote the output of head $h$ with $\mathbf{V}^{(h)}\!\coloneqq\!\tilde{\mathbf{H}} \mathbf{W}_{\mathrm{V}}^{(h)}$ and $\tilde{\mathbf{H}}$ being the hidden representations of inputs with $\mathbf{W}_{\mathrm{O}}^{(h)} \in \mathbb{R}^{\mathrm{d}_{\mathrm{head}} \times \mathrm{d}_{\

Figures (8)

  • Figure 1: Top-3 channel magnitudes across depth in Llama 2 7B and Qwen3 8B (post-residuals vs. block outputs). In both models, early blocks inject massive activations that persist through most of the network before being neutralized by late blocks.
  • Figure 2: Input-output characteristics of $\operatorname{SiLU}$ in step-up and step-down blocks of Llama 2 7B. Based on $1024$ randomly sampled sentences from C4 dataset raffel2020exploring, we plot the cosine similarity and norm ratio for each token. Points are colored by the maximum magnitude of the block output. For spike tokens (red points), both direction and norm remain largely unchanged, indicating that the $\operatorname{SiLU}$ gate operates in a near-identity regime.
  • Figure 3: Frobenius norms $\|\mathbf{U}_k\|_F$ for the quadratic forms in Llama 2 7B. Spike channels align with $\mathbf{U}_k$ matrices that have substantially larger norms than typical channels. These high-norm coordinates appear exclusively in step-up and step-down blocks.
  • Figure 4: Eigenvalue spectra of $\mathbf{S}_k$ for spike vs. non-spike channels in Llama 2 7B. Spike channels exhibit a single dominant eigenvalue $\lambda_\star$ that is orders of magnitude larger than the remainder of the spectrum; non-spike channels show no such outlier.
  • Figure 5: Cosine similarity among spike tokens before and after step-up block in Llama 2 7B. Pre-step-up representations vary across spike tokens, but post-step-up representations collapse to nearly identical directions, empirically validating the near-constant approximation.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 2.1: Attention output as a sum over heads
  • proof
  • Theorem 2.2: Quadratic-form approximation of a SiLU feed-forward coordinate
  • proof
  • Theorem 2.3: Coordinate bound under RMS normalization
  • proof