Table of Contents
Fetching ...

On the Emergence of Induction Heads for In-Context Learning

Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann

TL;DR

This work provides a principled account of how in-context learning capabilities arise in transformers by studying the emergence of induction heads during gradient-descent training on a minimal ICL task. Using a two-layer, attention-only transformer and a disentangled residual design, the authors prove that training dynamics stay in a $19$-dimensional subspace and that only $3$ pseudo-parameters are ultimately responsible for the induction head, with a tight $O(L^2)$ bound on emergence time in context length $L$. They also present a minimal formulation, derive the structured weight matrices analytically, and validate the theory empirically, offering a clear, interpretable mechanism for ICL formation. The findings illuminate how simple, interpretable weight structures underlie complex reasoning behaviors, with implications for understanding and guiding the training of future LLMs in data-efficient, reliable ways.

Abstract

Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

On the Emergence of Induction Heads for In-Context Learning

TL;DR

This work provides a principled account of how in-context learning capabilities arise in transformers by studying the emergence of induction heads during gradient-descent training on a minimal ICL task. Using a two-layer, attention-only transformer and a disentangled residual design, the authors prove that training dynamics stay in a -dimensional subspace and that only pseudo-parameters are ultimately responsible for the induction head, with a tight bound on emergence time in context length . They also present a minimal formulation, derive the structured weight matrices analytically, and validate the theory empirically, offering a clear, interpretable mechanism for ICL formation. The findings illuminate how simple, interpretable weight structures underlie complex reasoning behaviors, with implications for understanding and guiding the training of future LLMs in data-efficient, reliable ways.

Abstract

Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

Paper Structure

This paper contains 13 sections, 1 theorem, 10 equations, 3 figures.

Key Result

Theorem 1

Assume that we train a disentangled transformer from zero initialization with population loss on isotropic data on our ICL task. Then, the weight matrices will have the following structure throughout the entire training process: where we collect the parameters of each weight matrix in three vectors ${\bm{\alpha}} \in \mathbb{R}^3$, ${\bm{\beta}} \in \mathbb{R}^{12}$ and ${\bm{\gamma}} \in \mathbb

Figures (3)

  • Figure 1: Left: an induction head solving the in-context learning (ICL) task. Given a series of item-label pairs, the model predicts the correct label for a query item. The first attention head retrieves the corresponding item for each label, enabling the second attention head to retrieve the correct label. Each path is modulated by one pseudo-parameter ($\alpha_3$, $\beta_2$, or $\gamma_3$). Right: our minimal transformer architecture. We use two attention-only layers and a linear layer. We disentangle the attention layers by concatenating the inputs and outputs, rather than adding them together.
  • Figure 2: The weights of a two-layer attention-only transformer can be understood using a highly interpretable transformation. Dots $\cdot$ denote matrix multiplication. For example, the bottom-right block of the left plot, ${\bm{P}}^\intercal \, {{\bm{W}}_K^1}^\intercal {\bm{W}}_Q^1 {\bm{P}}$, is dominated by the subdiagonal, showing that each position attends to the previous. Some noise is due to random initialization and stochastic gradient descent.
  • Figure 3: Weights at the end of standard training have the theoretically predicted structure.

Theorems & Definitions (1)

  • Theorem 1