Table of Contents
Fetching ...

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei

TL;DR

<3-5 sentence high-level summary> The paper investigates extreme-token phenomena in transformers—attention sinks, value-state drains, and residual-state peaks—that appear in large language models. Using a simple Bigram-Backcopy task, the authors develop an active-dormant mechanism for attention heads and a mutual reinforcement dynamic where sinks and drains sustain each other during training. They demonstrate that these insights extend to pretrained LLMs (e.g., Llama, OLMo), predicting and validating domain-dependent active heads and sink-logits concentration. The work further shows that simple interventions, such as replacing SoftMax with ReLU or switching Adam to SGD, can mitigate these phenomena in toy models, suggesting potential pathways to improve inference and quantization in LLMs. Overall, the study provides a mechanistic account of extreme-token behavior and outlines practical mitigation strategies for pretraining.”

Abstract

Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

TL;DR

<3-5 sentence high-level summary> The paper investigates extreme-token phenomena in transformers—attention sinks, value-state drains, and residual-state peaks—that appear in large language models. Using a simple Bigram-Backcopy task, the authors develop an active-dormant mechanism for attention heads and a mutual reinforcement dynamic where sinks and drains sustain each other during training. They demonstrate that these insights extend to pretrained LLMs (e.g., Llama, OLMo), predicting and validating domain-dependent active heads and sink-logits concentration. The work further shows that simple interventions, such as replacing SoftMax with ReLU or switching Adam to SGD, can mitigate these phenomena in toy models, suggesting potential pathways to improve inference and quantization in LLMs. Overall, the study provides a mechanistic account of extreme-token behavior and outlines practical mitigation strategies for pretraining.”

Abstract

Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.

Paper Structure

This paper contains 45 sections, 12 theorems, 94 equations, 42 figures, 1 table.

Key Result

Theorem 1

For any parameters $(\bm{\alpha} \in \mathbb{R}^{V}, \boldsymbol{\beta} \in \mathbb{R}^V, \bm{\xi} \in \mathbb{R}^V, \lambda \in \mathbb{R})$, there exists a one-layer transformer as described in (eqn:simplified_transformer) with weight matrices $({\mathbf Q}, {\mathbf K}, {\mathbf V}, {\mathbf W}_1

Figures (42)

  • Figure 1: Extreme-token phenomena in Llama 3.1. We evaluate the attention weights, value states norm, and residual states norm on the Llama 3.1-8B-Base model, where the input sentence is "$\langle \texttt{s}\rangle$ Summer is warm$\langle$$\mathtt{period}$$\rangle$ Winter is cold$\langle$$\mathtt{period}$$\rangle$". Left (a): The attention weights across multiple heads at Layer 24. We observe the attention sink phenomenon: the $\langle \texttt{s}\rangle$ token attracts a significant portion of the overall attention weight. Middle (b): The empirical distribution of the norms of value states over all layers and all heads. We exclude 2% of the outlier values to help visualization. We observe the value-state drain phenomenon: the value state of the $\langle \texttt{s}\rangle$ token is much smaller than those of other tokens on average. Right (c): The norm of the residual stream states, measured at the output of each layer. We observe the residual-state peak phenomenon: the $\langle \texttt{s}\rangle$ token's residual states have significantly larger norms than those of other tokens from layers 1 to 30. We present the extreme-token phenomena over other input sequences in Appendix \ref{['sec:many_samples']}.
  • Figure 2: The Bigram-Backcopy task
  • Figure 3: Attention pattern
  • Figure 4: Small value states
  • Figure 6: Active-dormant mechanism
  • ...and 37 more figures

Theorems & Definitions (25)

  • Claim 1: Active-dormant mechanism
  • Theorem 1: Existence of reparameterization that solves the BB task; informal
  • Theorem 2
  • Claim 2: Mutual reinforcement mechanism
  • Lemma A.1
  • proof : Proof of \ref{['appthm:positive-definite']}
  • Lemma A.2
  • proof : Proof of Lemma \ref{['appthm:min-eigenvalue']}
  • Lemma A.3
  • proof : Proof of Lemma \ref{['appthm:min-eigenvalue-q']}
  • ...and 15 more