Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers
Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang
TL;DR
The paper proves that training a two-layer transformer on $n$-gram Markov data with relative positional embeddings and a normalized FFN leads to a three-stage gradient-flow convergence to a generalized induction head (GIH). The first stage enables the FFN to identify an information set ${\mathcal S}^\star$ via a modified $\chi^2$-mutual information criterion; the second stage trains the first attention layer to copy the information from the selected parents; the final stage grows the classifier weight to realize exponential-kernel style aggregation over matching histories. This collaboration of FFN, multi-head attention, and normalization yields a limiting model that performs ICL by kernel-like regression, generalizing the induction-head concept to multi-parent $n$-gram data. The work provides both a rigorous optimization-theoretic justification and empirical validation, and suggests how such a mechanism might extend to more complex transformer architectures. Overall, it offers a principled account of how ICL can emerge from the training dynamics of realistic transformer components, with implications for understanding and improving ICL in larger models.
Abstract
In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a $\mathit{copier}$, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a $\mathit{selector}$ that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a $\mathit{classifier}$ that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.
