Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei
TL;DR
<3-5 sentence high-level summary> The paper investigates extreme-token phenomena in transformers—attention sinks, value-state drains, and residual-state peaks—that appear in large language models. Using a simple Bigram-Backcopy task, the authors develop an active-dormant mechanism for attention heads and a mutual reinforcement dynamic where sinks and drains sustain each other during training. They demonstrate that these insights extend to pretrained LLMs (e.g., Llama, OLMo), predicting and validating domain-dependent active heads and sink-logits concentration. The work further shows that simple interventions, such as replacing SoftMax with ReLU or switching Adam to SGD, can mitigate these phenomena in toy models, suggesting potential pathways to improve inference and quantization in LLMs. Overall, the study provides a mechanistic account of extreme-token behavior and outlines practical mitigation strategies for pretraining.”
Abstract
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
