Table of Contents
Fetching ...

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

TL;DR

The paper proves that training a two-layer transformer on $n$-gram Markov data with relative positional embeddings and a normalized FFN leads to a three-stage gradient-flow convergence to a generalized induction head (GIH). The first stage enables the FFN to identify an information set ${\mathcal S}^\star$ via a modified $\chi^2$-mutual information criterion; the second stage trains the first attention layer to copy the information from the selected parents; the final stage grows the classifier weight to realize exponential-kernel style aggregation over matching histories. This collaboration of FFN, multi-head attention, and normalization yields a limiting model that performs ICL by kernel-like regression, generalizing the induction-head concept to multi-parent $n$-gram data. The work provides both a rigorous optimization-theoretic justification and empirical validation, and suggests how such a mechanism might extend to more complex transformer architectures. Overall, it offers a principled account of how ICL can emerge from the training dynamics of realistic transformer components, with implications for understanding and improving ICL in larger models.

Abstract

In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a $\mathit{copier}$, copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a $\mathit{selector}$ that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a $\mathit{classifier}$ that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

TL;DR

The paper proves that training a two-layer transformer on -gram Markov data with relative positional embeddings and a normalized FFN leads to a three-stage gradient-flow convergence to a generalized induction head (GIH). The first stage enables the FFN to identify an information set via a modified -mutual information criterion; the second stage trains the first attention layer to copy the information from the selected parents; the final stage grows the classifier weight to realize exponential-kernel style aggregation over matching histories. This collaboration of FFN, multi-head attention, and normalization yields a limiting model that performs ICL by kernel-like regression, generalizing the induction-head concept to multi-parent -gram data. The work provides both a rigorous optimization-theoretic justification and empirical validation, and suggests how such a mechanism might extend to more complex transformer architectures. Overall, it offers a principled account of how ICL can emerge from the training dynamics of realistic transformer components, with implications for understanding and improving ICL in larger models.

Abstract

In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on -gram Markov chain data, where each token in the Markov chain statistically depends on the previous tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the induction head mechanism with a learned feature, resulting from the congruous contribution of all the building blocks. In the limiting model, the first attention layer acts as a , copying past tokens within a given window to each position, and the feed-forward network with normalization acts as a that generates a feature vector by only looking at informationally relevant parents from the window. Finally, the second attention layer is a that compares these features with the feature at the output position, and uses the resulting similarity scores to generate the desired output. Our theory is further validated by experiments.
Paper Structure (97 sections, 28 theorems, 362 equations, 9 figures, 1 table)

This paper contains 97 sections, 28 theorems, 362 equations, 9 figures, 1 table.

Key Result

Theorem 3.6

Suppose asp:initialization and asp:Markov_chain hold. Consider $H\ge M$. We set $\varepsilon = L^{-1/2}$ for the cross-entropy loss and assume $L$ is sufficiently large. Then the following holds for the three-stage training of gradient flow:

Figures (9)

  • Figure 2: A two-gram Markov chain with parent set ${\mathtt{pa}} = \{-1, -2\}$.
  • Figure 3: Illustration of the relationship between RPE vector $w^{(h)}$ and corresponding matrix $W_P^{(h)}$.
  • Figure 4: Illustration of the GIH mechanism in a two-attention-layer transformer model. Here, ${\mathtt{pa}}=\{-1, -2\}$, $M=3$ and ${\mathcal{S}}^\star=\{1, 2\}$. The first attention layer copies the parents (including the information set ${\mathcal{S}}^\star$) to the current position. Then the FFN layer together with layer normalization generates the features $u_l$ using the parent tokens within the information set ${\mathcal{S}}^\star$. The second attention layer treats each $x_l$ as the value, and aggregates $x_l$ as the prediction by matching the keys and query that come from the learned features using the attention mechanism. The $L+1$-th token is padded with zeros in the input.
  • Figure 5: Limiting model of $\mathtt{TF}(M=3,H=3, d=3, D=2)$ that implements the GIH mechanism with $L=100$, ${\mathtt{pa}}=\{-1, -2\}$. (a): The top left $10$ by $10$ block of $W_P^{(1)}$ that attends to the $-1$ parent. (b): The RPE weight heatmap for all 3 heads, where the $h$-th column corresponds to the RPE weight vector of head $h$. (c): In the GIH mechanism, only one $c_{\mathcal{S}}^\star$ for the optimal information set ${\mathcal{S}}^\star$ dominates. For the label of the $x$-axis, we use a binary coding $\{0, 1\}^3$ to indicate each subset ${\mathcal{S}}$. Here, ${\mathcal{S}}^\star = \{1,2\}$ is the parent set, which is represented by "110".
  • Figure 6: An illustration of the transformer parameters during the three-stage training. We train a transformer in $\mathtt{TF}(M=3,H=3, d=3, D=2)$ with $L=100$, ${\mathtt{pa}}=\{-1, -2\}$. See \ref{['sec:experiments']} for more details of the simulation. In (a) we show the evolution of $\{p_{\mathcal{S}}\}_{{\mathcal{S}} \in [H]_{\leq D} }$ in the first stage of training where $p_{{\mathcal{S}}} = c_{{\mathcal{S}}}^2/ \sum_{{\mathcal{S}}'\in[H]_{\leq D}} c_{{\mathcal{S}}'}^2$. We use a binary coding in $\{0, 1\}^3$ to indicate each subset ${\mathcal{S}}$. Recall that "110" represents $= \{1,2\}$, which is exactly ${\mathcal{S}}^\star$. This figure shows that $p_{{\mathcal{S}}^\star}$ gradually increases to one while the any other $p_{{\mathcal{S}}}$ decays to zero. In (b) we plot the RPE weights of the first attention layer before and after the second stage of training. Here the $h$-th column corresponds to the RPE weight vector of head $h$. This figure shows that $w^{(1)} _{-1}$ and $w^{(2)} _{-2}$ increase to a large number after training, while $w^{(3)} _{-3}$ stays close to its initial value. Thus, we have $\sigma (w^{(1)}) \approx \sigma(w^{(2)}) \approx 1$. That is, the first two heads are trained to attend to parents $-1$ and $-2$, respectively. In (c) we plot the evolution of $a$ in the last stage of training. This figure clearly exhibits a two-step growth pattern and $a$ keeps increasing throughout this stage. In summary, the results of the simulation experiments coincide with the theoretical results.
  • ...and 4 more figures

Theorems & Definitions (34)

  • Definition 3.1: Modified $\chi^2$-Mutual Information
  • Definition 3.2: Generalized Induction Head
  • Definition 3.4: Primitive Matrix
  • Theorem 3.6: Convergence of Gradient Flow
  • Corollary 3.7
  • Lemma B.1
  • Definition B.2: Irreducible Matrix
  • Definition B.3: Primitive Matrix
  • Theorem B.4: Perron-Frobenius Theorem for Primitive Matrices
  • Lemma C.1
  • ...and 24 more