Table of Contents
Fetching ...

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin

TL;DR

This work provides a comprehensive empirical analysis of attention sink in autoregressive language models, demonstrating that the first token consistently attracts disproportionate attention and that this sink emerges during pre-training across model sizes and input types. By introducing metrics and probing the influence of optimization, data distribution, loss functions, and architectural choices, the authors show that sink behaves like key biases stored in attention, and can be mitigated by replacing softmax with sigmoid-based attention or by altering bias structures. The study reveals that sink location can shift with data and architectural changes, and that biases or alternative attention mechanisms can eliminate sink while preserving performance up to 1B parameters. The findings have practical implications for streaming/long-context inference, KV caching, and model quantization, and point to future work on sink tokens beyond the first position and their impact on downstream tasks.

Abstract

Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

When Attention Sink Emerges in Language Models: An Empirical View

TL;DR

This work provides a comprehensive empirical analysis of attention sink in autoregressive language models, demonstrating that the first token consistently attracts disproportionate attention and that this sink emerges during pre-training across model sizes and input types. By introducing metrics and probing the influence of optimization, data distribution, loss functions, and architectural choices, the authors show that sink behaves like key biases stored in attention, and can be mitigated by replacing softmax with sigmoid-based attention or by altering bias structures. The study reveals that sink location can shift with data and architectural changes, and that biases or alternative attention mechanisms can eliminate sink while preserving performance up to 1B parameters. The findings have practical implications for streaming/long-context inference, KV caching, and model quantization, and point to future work on sink tokens beyond the first position and their impact on downstream tasks.

Abstract

Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

Paper Structure

This paper contains 33 sections, 4 theorems, 31 equations, 30 figures, 15 tables.

Key Result

Proposition 1

For LMs with NoPE, the attention scores for $t$ repeated tokens are $t^{-1}$ uniformly, i.e., there is no attention sink.

Figures (30)

  • Figure 1: (Left) Architecture of pre-norm transformer block (we highlight the location of post-norm LN using dashed lines). We denote the output of MHSA as $\boldsymbol{O}^l$ and the output of FFN as $\boldsymbol{F}^l$. (Right) The packing strategy in the LM pre-training. All documents are concatenated with BOS (optional) and EOS tokens as the boundaries. Then it is chunked into equal-sized sequences with context length $C$.
  • Figure 2: In LLaMA3-8B Base, (Top) the first token has significantly larger $\ell_2$-norm of hidden states, but much smaller $\ell_2$-norm of keys and values than the mean of other tokens; (Bottom) cosine similarity instead of norm product contributes to attention sink. We delay more visualizations to Appendix \ref{['vis_norm']}.
  • Figure 3: The metric $\textrm{Sink}_1^{\epsilon}$ (averaged on 100 sequences) tends to decrease with larger token lengths $T$. This tendency becomes more obvious with the more strict definition of attention sink (larger $\epsilon$).
  • Figure 4: (Left) Attention sink also emerges in small LMs. (Middle) Dynamics of train/valid loss and $\textrm{Sink}_1^\epsilon$ during LM pre-training under the default setup. Attention sink emerges after certain optimization steps. (Right) Training loss (solid lines)/attention sink (dashed lines) dynamics of LMs using different learning rates. We observe that with smaller learning rates, attention sink tends to emerge after more optimization steps and be less obvious.
  • Figure 5: (Left) Attention pattern for prefix language modeling. (Middle) Attention sink does not only appear on the first token but among the prefix tokens for LMs with $p=\textrm{5}$. (Right) With less training data, attention sink disappears. Meanwhile, trained LMs demonstrate overfitting behaviors.
  • ...and 25 more figures

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof