Table of Contents
Fetching ...

Transformers on Markov Data: Constant Depth Suffices

Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar, Ashok Vardhan Makkuva

TL;DR

The paper probes how transformers learn context in sequences drawn from a $k$-th order Markov process, uncovering that constant-depth, single-head architectures can represent the in-context conditional empirical distribution, and that a 3-layer, 1-head transformer can represent the conditional $k$-gram model. It shows that attention-only models with $O(\log k)$ layers can achieve the same with carefully constructed induction heads, while non-linearities via layer normalization are crucial for efficient constant-depth constructions, enabling a $3$-layer realization. The authors provide both constructive proofs (L2-norm attention realizing $k$-order induction heads) and lower bounds (1-layer and attention-only bounds under reasonable assumptions), establishing depth as a key resource for scaling context beyond simple dependencies. These insights enhance understanding of in-context learning mechanisms and the architectural features that enable long-range dependency capture in transformers.

Abstract

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.

Transformers on Markov Data: Constant Depth Suffices

TL;DR

The paper probes how transformers learn context in sequences drawn from a -th order Markov process, uncovering that constant-depth, single-head architectures can represent the in-context conditional empirical distribution, and that a 3-layer, 1-head transformer can represent the conditional -gram model. It shows that attention-only models with layers can achieve the same with carefully constructed induction heads, while non-linearities via layer normalization are crucial for efficient constant-depth constructions, enabling a -layer realization. The authors provide both constructive proofs (L2-norm attention realizing -order induction heads) and lower bounds (1-layer and attention-only bounds under reasonable assumptions), establishing depth as a key resource for scaling context beyond simple dependencies. These insights enhance understanding of in-context learning mechanisms and the architectural features that enable long-range dependency capture in transformers.

Abstract

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.
Paper Structure (40 sections, 11 theorems, 121 equations, 10 figures, 3 tables)

This paper contains 40 sections, 11 theorems, 121 equations, 10 figures, 3 tables.

Key Result

Theorem 4.1

The conditional $1$-gram model can be represented by a $2$-layer and $1$-head attention-only transformer with embedding dimension $d = 3S+2$.

Figures (10)

  • Figure 1: $k^{\text{th}}$-order Markov processes for $k=4$. The next symbol $X_{n+1}$ in the sequence is sampled from the distribution $P(\cdot | X_n, X_{n-1}, X_{n-2}, X_{n-3})$ which only depends on the last $k (=4)$ symbols (marked in red).
  • Figure 2: Conditional $k$-gram model. The conditional $k$-gram is the in-context estimate of the Markov process and is realized in two steps. The first step is to find the locations in the sequence (marked red) which match the final $k$ symbols (functionally, a $k^{\text{th}}$-order induction head). The conditional $k$-gram model returns the uniform distribution over the next symbol at these locations (marked blue).
  • Figure 3: Transformer architecture. POS refers to the relative position encodings.
  • Figure 4: Gap with the optimal test loss for $(a)$ a $2$-layer, $1$-head transformer model (above), and $(b)$ a $3$-layer, $1$-head transformer (below), averaged over $3$ runs for each $k$. The models learn the conditional $k$-gram model for randomly sampled $k$-th order Markov processes, even for large $k$.
  • Figure 5: $k^{\text{th}}$-order induction head for $k=2$. The attention pattern $\operatorname{att}_{T,n}$ is maximized for those values of $n$ at which $x_{T-j+1} = x_{n-j}$ for all $j \in [k]$. These are the positions where the $k$-length prefix at those positions matches with the last $k$ symbols in the sequence.
  • ...and 5 more figures

Theorems & Definitions (18)

  • Definition 2.1: Conditional $k$-gram model
  • Theorem 4.1
  • Remark 4.2
  • Definition 4.3: Higher-order induction head
  • Theorem 4.4
  • Theorem 4.5
  • Theorem 5.1
  • Remark 5.2
  • Theorem 6.1
  • Theorem 6.3
  • ...and 8 more