Table of Contents
Fetching ...

Temporal Dependencies in In-Context Learning: The Role of Induction Heads

Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Billy Dickson, Zoran Tiganj

Abstract

Large language models (LLMs) exhibit strong in-context learning capabilities, but how they track and retrieve information from context remains underexplored. Drawing on the free recall paradigm in cognitive science (where participants recall list items in any order), we show that several open-source LLMs consistently display a serial-recall-like pattern, assigning peak probability to tokens that immediately follow a repeated token in the input sequence. Through systematic ablation experiments, we show that induction heads, specialized attention heads that attend to the token following a previous occurrence of the current token, play an important role in this phenomenon. Removing heads with a high induction score substantially reduces the +1 lag bias, whereas ablating random heads does not reproduce the same reduction. We also show that removing heads with high induction scores impairs the performance of models prompted to do serial recall using few-shot learning to a larger extent than removing random heads. Our findings highlight a mechanistically specific connection between induction heads and temporal context processing in transformers, suggesting that these heads are especially important for ordered retrieval and serial-recall-like behavior during in-context learning.

Temporal Dependencies in In-Context Learning: The Role of Induction Heads

Abstract

Large language models (LLMs) exhibit strong in-context learning capabilities, but how they track and retrieve information from context remains underexplored. Drawing on the free recall paradigm in cognitive science (where participants recall list items in any order), we show that several open-source LLMs consistently display a serial-recall-like pattern, assigning peak probability to tokens that immediately follow a repeated token in the input sequence. Through systematic ablation experiments, we show that induction heads, specialized attention heads that attend to the token following a previous occurrence of the current token, play an important role in this phenomenon. Removing heads with a high induction score substantially reduces the +1 lag bias, whereas ablating random heads does not reproduce the same reduction. We also show that removing heads with high induction scores impairs the performance of models prompted to do serial recall using few-shot learning to a larger extent than removing random heads. Our findings highlight a mechanistically specific connection between induction heads and temporal context processing in transformers, suggesting that these heads are especially important for ordered retrieval and serial-recall-like behavior during in-context learning.

Paper Structure

This paper contains 15 sections, 1 equation, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Induction scores for the four models across layers and heads. Top row: Base models. Middle row: Instruction-tuned models. Bottom row: difference in induction scores for each layer and head between instruction-tuned and base models.
  • Figure 2: Impact of induction and random head ablation (100 heads in each case) on the model output probability as a function of lag. The models were presented with a sequence of 501 tokens where the last token repeated the token at index 250 and the lag is defined relative to the repeated token (see Methods for more details and Fig. \ref{['fig:CRP_6']} for a zoomed version showing lags -6 to 6). The results show averages across 5000 runs with shuffled token sequences. Top row: Base models. Bottom row: Instruction-tuned models.
  • Figure 3: Impact of induction and random head ablation (100 heads in each case) on the model output probability as a function of lag: same as Fig. \ref{['fig:CRP_250']}, but with zoom on lags -6 to 6 to emphasize that the highest probabilities were at lags 1 or 0. Top row: Base models. Bottom row: Instruction-tuned models.
  • Figure 4: Impact of induction head ablation on the model output probability for the token at lag +1. The models were presented with a sequence of 501 tokens where the last token repeated the token at index 250 and the lag is defined relative to the repeated token, hence the probability at lag +1 is the probability that the model assigns to token 251 (see Methods for more details). The results show averages across 5000 runs with shuffled token sequences. We ablated the following numbers of induction heads (sorted by the induction scores) and random heads (x-axis): 1, 5, 10, 20, 30, 40, 50, 80, 100, 150, 200, 250 and 300. Top row: Base models. Bottom row: Instruction-tuned models. Exact numerical values for these experiments are shown in Table \ref{['tab:tab_indhead_base']} and Table \ref{['tab:tab_indhead_inst']} for induction head ablation in base and instruction-tuned models respectively and in Table \ref{['tab:tab_randhead_base']} and Table \ref{['tab:tab_randhead_inst']} for random head ablation in base and instruction-tuned models respectively.
  • Figure 5: Model performance in ICL is more sensitive to ablation of induction than random heads. Conditional recall probability at lag +1 for Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct as a function of the number of ablated induction and random attention heads. Exact values are shown in Table \ref{['tab:ICL_serial_recall']}.
  • ...and 12 more figures