Table of Contents
Fetching ...

Interpreting the Repeated Token Phenomenon in Large Language Models

Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, Yossi Gandelsman

TL;DR

The paper investigates why large language models sometimes fail to faithfully repeat a single input token, identifying the attention-sink mechanism as the root cause. Through mechanistic interpretability, it uncovers a two-stage neural circuit in which the first attention layer marks the initial token and a later MLP neuron amplifies its hidden state to create a high-norm sink that attracts subsequent attention; this sink is erroneously triggered by sequences of repeated tokens, causing divergence. The authors demonstrate this via cross-model evidence and formalize how the first attention layer cannot distinguish a single token from long repeats, leading to the observed behavior; they also introduce a cluster-attack that induces sinks without repetition. A targeted patch to sink-mediating neurons mitigates the divergence with minimal impact on unrelated tasks, highlighting how mechanistic insights can guide secure and reliable improvements to LLMs. Overall, the work links fluency-driven attention dynamics to a concrete vulnerability and proposes a principled defense strategy grounded in neural-circuit understanding, with implications for interpretability-driven model hardening.

Abstract

Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end-users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of ``attention sinks'', an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other non-repeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the model's overall performance. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.

Interpreting the Repeated Token Phenomenon in Large Language Models

TL;DR

The paper investigates why large language models sometimes fail to faithfully repeat a single input token, identifying the attention-sink mechanism as the root cause. Through mechanistic interpretability, it uncovers a two-stage neural circuit in which the first attention layer marks the initial token and a later MLP neuron amplifies its hidden state to create a high-norm sink that attracts subsequent attention; this sink is erroneously triggered by sequences of repeated tokens, causing divergence. The authors demonstrate this via cross-model evidence and formalize how the first attention layer cannot distinguish a single token from long repeats, leading to the observed behavior; they also introduce a cluster-attack that induces sinks without repetition. A targeted patch to sink-mediating neurons mitigates the divergence with minimal impact on unrelated tasks, highlighting how mechanistic insights can guide secure and reliable improvements to LLMs. Overall, the work links fluency-driven attention dynamics to a concrete vulnerability and proposes a principled defense strategy grounded in neural-circuit understanding, with implications for interpretability-driven model hardening.

Abstract

Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end-users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of ``attention sinks'', an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other non-repeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the model's overall performance. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.

Paper Structure

This paper contains 30 sections, 3 theorems, 12 equations, 11 figures, 2 tables.

Key Result

Theorem 4.1

Let $x$ be a token and $T$ a Transformer. Consider a sequence $S_n$ with $k$ fixed prefix tokens and $n$ repetitions of $x$ and a singleton sequence $S^*$ which consists of a single $x$. As $n \to \infty$ the representation of the last element of $S_n$ converges (strongly) to the representation of $

Figures (11)

  • Figure 1: Attention scores for layers 2, 3, 17, and 31 of LLaMA2-7B-HF. As can be seen in the figure, the attention scores for the repeated "the" tokens in the top panel are significantly higher than those for other tokens in the sequence. This high attention is comparable to the attention received by the first token in the regular sentences shown in the bottom panel. This similarity suggests a connection between the attention sink mechanism and the high attention given to repeated tokens.
  • Figure 2: Repeated tokens exhibit extreme norms, similar to the beginning-of-sequence (BoS) token, in early layers. We present the norm of the hidden state at the sink layer (layer 1) for three repeating words. As the number of repetitions increases, the norm increases and becomes similar to the norm of the BoS token (0 repetitions). This observation explains the high attention scores shown in Figure \ref{['fig:teaser']}. We later provide causal evidence for this relationship through ablation (\ref{['fig:neuron_ablation']}). We also show the norm of the BoS token and the average norm of tokens from Tiny Shakespeare dataset tinyshakespear for comparison.
  • Figure 3: Ablation of sink neurons. Norm is reduced both for BoS and repeat sequences. Top: Token activations norms without the patch. Bottom: Token activations norms with the patch. Ablating specific neurons significantly reduces the high norms associated with repeated tokens. Data is from LLaMa-2. Sink-Neurons (\ref{['tab:findings']}) were zero-ablated. The input consisted of 1200 repeats of the tokens ['Another', 'one', 'bit', 'es', 'the', 'dust']. See \ref{['appendix:generalization']} for similar results on other models.
  • Figure 4: The first token and subsequent tokens belong to distinct, linearly separable subspaces. This figure shows the projection of token representations after the first attention layer. The different colors represent the first tokens and subsequent tokens. The clear separation indicates linear separability. Furthermore, we identified a single neuron ($\text{MLP}_0$, gate neuron 912 in LLaMa2) that perfectly separates these subspaces.
  • Figure 5: Empirical evidence showing the first attention layer does not distinguish between token repetitions and the first token. We first computed the output of the first attention layer for an input of a single token, without BoS. Then compared it, using L2 norm of the difference, to the output of the first attention layer for "$<$BoS$>$ some prefix:", appended to the same token {'cat', 'dog', 'the' or 'bla'} repeated 500 times on LLaMa-2. This supports \ref{['thm:informal:main']} showing the converges takes place in practice with less than $max\_context\_window$ repeats.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Theorem 4.1: Informal.
  • Lemma D.1
  • proof
  • Theorem D.2
  • proof