Table of Contents
Fetching ...

What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang

TL;DR

This work proves that a two-layer transformer with a single attention head per layer can represent any conditional k-gram model, thereby providing the tightest known depth-characterization of transformer ICL for kth-order Markov processes. It also shows a gradient-descent learning dynamic for first-order Markov chains, using a two-stage training protocol that first learns positional encodings and then sharpens attention to realize the induction head. The results emphasize a depth-width trade-off, demonstrating that depth can be reduced with increased width (or vice versa) while preserving the ability to model in-context distributions. Together, the findings deepen theoretical understanding of ICL in compact transformer architectures and suggest practical avenues for more parameter-efficient sequence models. The work also clarifies the critical role of nonlinearities and layer normalization in enabling higher-order induction heads, with potential implications for designing efficient ICL-focused transformers.

Abstract

In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

TL;DR

This work proves that a two-layer transformer with a single attention head per layer can represent any conditional k-gram model, thereby providing the tightest known depth-characterization of transformer ICL for kth-order Markov processes. It also shows a gradient-descent learning dynamic for first-order Markov chains, using a two-stage training protocol that first learns positional encodings and then sharpens attention to realize the induction head. The results emphasize a depth-width trade-off, demonstrating that depth can be reduced with increased width (or vice versa) while preserving the ability to model in-context distributions. Together, the findings deepen theoretical understanding of ICL in compact transformer architectures and suggest practical avenues for more parameter-efficient sequence models. The work also clarifies the critical role of nonlinearities and layer normalization in enabling higher-order induction heads, with potential implications for designing efficient ICL-focused transformers.

Abstract

In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

Paper Structure

This paper contains 78 sections, 11 theorems, 319 equations, 20 figures, 7 tables.

Key Result

Lemma 1

Let $\theta = \left(\{p_i^{(T_1)}\}_{i=0}^n, a_2^{(t)}\right)$, where $\{p_i^{(T_1)}\}_{i=0}^n\}$ denotes the output after the first stage of training. If $a_2$ satisfies $\exp(a_2) \leq \exp(a_2^*) := C_{\gamma, S} \, n^{1/12} \log^{-1/6} n,$ and $a_{2, 0} > 0$ (at initialization), then

Figures (20)

  • Figure 1: Attention maps learnt by a two-layer, single-head transformer trained on sequences generated by random Markov chains of order $3$ (\ref{['sec:markov_background']}). (i) in the first layer, the attention map shows a clear pattern: attention weights increase monotonically along the first three lower-diagonals and drop to zero beyond that. This suggests that the relative positional bias is maximized the diagonal with an offset of $-k=-3$, i.e., the third diagonal below the main diagonal, which is consistent with the construction in \ref{['sec:twolayersinglehead']}, (ii) in the second layer, the attention map closely resembles the ideal attention pattern required to approximate the conditional $k$-gram estimator. We note that all experiments were conducted using standard initialization schemes. Additional experiments with different orders of markov chains and experimental details are provided in \ref{['app:experiments']}. In \ref{['subsec:onelayertransformers']}, we also experimentally demonstrate that single-layer transformers fail to solve the induction head task with the same order of parameters. Finally, in \ref{['sec:noisy-sequences']}, we test the robustness of the two-layer, single-head model to noise in the input sequences.
  • Figure 2: The MLP Architecture
  • Figure 3: The MLP Architecture
  • Figure 4: Layer 1: Average attention map computed over multiple sequences.
  • Figure 5: Layer 2: Attention map corresponding to a randomly sampled sequence (denoted as sequence–1)
  • ...and 15 more figures

Theorems & Definitions (25)

  • proof : Proof sketch.
  • Remark 1
  • proof : Proof sketch.
  • Remark 2
  • proof : Proof Sketch.
  • Lemma 1: Lemma D.8 in nichani2024how
  • Lemma 2: Exension of Lemma G.1. in nichani2024how to $k$th-order
  • proof
  • Lemma 3: Extension of Lemma D.3 in nichani2024how
  • proof
  • ...and 15 more