On the Emergence of Induction Heads for In-Context Learning
Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann
TL;DR
This work provides a principled account of how in-context learning capabilities arise in transformers by studying the emergence of induction heads during gradient-descent training on a minimal ICL task. Using a two-layer, attention-only transformer and a disentangled residual design, the authors prove that training dynamics stay in a $19$-dimensional subspace and that only $3$ pseudo-parameters are ultimately responsible for the induction head, with a tight $O(L^2)$ bound on emergence time in context length $L$. They also present a minimal formulation, derive the structured weight matrices analytically, and validate the theory empirically, offering a clear, interpretable mechanism for ICL formation. The findings illuminate how simple, interpretable weight structures underlie complex reasoning behaviors, with implications for understanding and guiding the training of future LLMs in data-efficient, reliable ways.
Abstract
Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.
