The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, Nikolaos Tsilivis
TL;DR
We present a controlled study of in-context learning (ICL) by training Transformers on an ICL-MC task where each sequence is generated from a Markov chain drawn from a Dirichlet prior. The main contribution is the discovery of statistical induction heads that implement in-context bigram statistics, with learning unfolding in distinct phases from uniform predictions to unigram-based, and finally to Bayes-optimal bigram predictions. A minimal linear-transformer model corroborates the two-step gradient dynamics, showing the second layer learns first and that unigram signals can hinder rapid formation of the bigram solution; extending to $n$-grams demonstrates generalization of the hierarchical learning phenomenon. These results illuminate mechanistic pathways for ICL in LLMs and suggest how simple priors and curriculum-like shifts shape the emergence of complex in-context algorithms, with potential implications for understanding and improving in-context reasoning in real language models.
Abstract
Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induction heads} which compute accurate next-token probabilities given the bigram statistics of the context. During the course of training, models pass through multiple phases: after an initial stage in which predictions are uniform, they learn to sub-optimally predict using in-context single-token statistics (unigrams); then, there is a rapid phase transition to the correct in-context bigram solution. We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer's layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution. We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to $n$-grams for $n > 2$.
