Table of Contents
Fetching ...

How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective

Runyu Peng, Ruixiao Li, Mingshu Chen, Yunhua Zhou, Qipeng Guo, Xipeng Qiu

TL;DR

By analyzing training traces from a 30B A3B MoE model trained from scratch, it is found that this mechanism emerges early in training and becomes increasingly concentrated in the first two layers, suggesting a possible signal for tracking pre training convergence states.

Abstract

Large Language Models (LLMs) often allocate disproportionate attention to specific tokens, a phenomenon commonly referred to as the attention sink. While such sinks are generally considered detrimental, prior studies have identified a notable exception: the model's consistent emphasis on the first token of the input sequence. This structural bias can influence a wide range of downstream applications and warrants careful consideration. Despite its prevalence, the precise mechanisms underlying the emergence and persistence of attention sinks remain poorly understood. In this work, we trace the formation of attention sinks around the first token of the input. We identify a simple mechanism, referred to as the P0 Sink Circuit, that enables the model to recognize token at position zero and induce an attention sink within two transformer blocks, without relying on any semantic information. This mechanism serves as the basis for the attention sink on position zero. Furthermore, by analyzing training traces from a 30B A3B MoE model trained from scratch, we find that this mechanism emerges early in training and becomes increasingly concentrated in the first two layers, suggesting a possible signal for tracking pre training convergence states.

How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective

TL;DR

By analyzing training traces from a 30B A3B MoE model trained from scratch, it is found that this mechanism emerges early in training and becomes increasingly concentrated in the first two layers, suggesting a possible signal for tracking pre training convergence states.

Abstract

Large Language Models (LLMs) often allocate disproportionate attention to specific tokens, a phenomenon commonly referred to as the attention sink. While such sinks are generally considered detrimental, prior studies have identified a notable exception: the model's consistent emphasis on the first token of the input sequence. This structural bias can influence a wide range of downstream applications and warrants careful consideration. Despite its prevalence, the precise mechanisms underlying the emergence and persistence of attention sinks remain poorly understood. In this work, we trace the formation of attention sinks around the first token of the input. We identify a simple mechanism, referred to as the P0 Sink Circuit, that enables the model to recognize token at position zero and induce an attention sink within two transformer blocks, without relying on any semantic information. This mechanism serves as the basis for the attention sink on position zero. Furthermore, by analyzing training traces from a 30B A3B MoE model trained from scratch, we find that this mechanism emerges early in training and becomes increasingly concentrated in the first two layers, suggesting a possible signal for tracking pre training convergence states.
Paper Structure (23 sections, 11 equations, 30 figures, 3 tables)

This paper contains 23 sections, 11 equations, 30 figures, 3 tables.

Figures (30)

  • Figure 1: Overview of the proposed P0-Sink Circuit. Within just two transformer blocks, the model learns to identify the position-zero (P0) token and amplify it into a fixed high-norm representation, which gives rise to the attention sink effect.
  • Figure 2: Layer-wise $\ell_2$ norm of hidden states and attention score heat maps in Qwen3-4B. Although this model has no [BOS] token, its position-zero (P0) sink appears after layer 2 and becomes pronounced after layer 7, alongside an increase in the $\ell_2$ norm at position zero. Half-integer layer indices denote attention outputs after the residual connection.
  • Figure 3: Layer-wise $\ell_2$ norm of hidden states and attention score heat maps in LLaMA3.1-8B. Half-integer indices correspond to the output of attention modules after residual connection. Removing the [BOS] token eliminates the layer-1 attention sink and its associated $\ell_2$ norm amplification. However, due to renewed norm growth before layer 2, a position-zero (P0) sink re-emerges in the attention maps even in the absence of the [BOS] token.
  • Figure 4: Average cosine similarity among hidden states at different positions across inputs, along with the corresponding mean hidden state vectors at position zero in LLaMA3.1-8B. Half-integer indices indicate the outputs of attention modules after residual addition. The [BOS] token has been removed, so position zero corresponds to the first non-[BOS] token.
  • Figure 5: Visualization of intermediate MLP activations and attention scores in layer 0 of LLaMA3.1-8B. The [BOS] token has been removed. Several heads exhibit strong locality by primarily attending to neighboring tokens. Ablating individual heads does not weaken the clustering behavior at position zero; in fact, removing certain heads can even enhance it. This suggests that the effect arises from a complex, collaborative mechanism involving multiple attention heads.
  • ...and 25 more figures