Transformers on Markov Data: Constant Depth Suffices

Nived Rajaraman; Marco Bondaschi; Kannan Ramchandran; Michael Gastpar; Ashok Vardhan Makkuva

Transformers on Markov Data: Constant Depth Suffices

Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar, Ashok Vardhan Makkuva

TL;DR

The paper probes how transformers learn context in sequences drawn from a $k$-th order Markov process, uncovering that constant-depth, single-head architectures can represent the in-context conditional empirical distribution, and that a 3-layer, 1-head transformer can represent the conditional $k$-gram model. It shows that attention-only models with $O(\log k)$ layers can achieve the same with carefully constructed induction heads, while non-linearities via layer normalization are crucial for efficient constant-depth constructions, enabling a $3$-layer realization. The authors provide both constructive proofs (L2-norm attention realizing $k$-order induction heads) and lower bounds (1-layer and attention-only bounds under reasonable assumptions), establishing depth as a key resource for scaling context beyond simple dependencies. These insights enhance understanding of in-context learning mechanisms and the architectural features that enable long-range dependency capture in transformers.

Abstract

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.

Transformers on Markov Data: Constant Depth Suffices

TL;DR

The paper probes how transformers learn context in sequences drawn from a

-th order Markov process, uncovering that constant-depth, single-head architectures can represent the in-context conditional empirical distribution, and that a 3-layer, 1-head transformer can represent the conditional

-gram model. It shows that attention-only models with

layers can achieve the same with carefully constructed induction heads, while non-linearities via layer normalization are crucial for efficient constant-depth constructions, enabling a

-layer realization. The authors provide both constructive proofs (L2-norm attention realizing

-order induction heads) and lower bounds (1-layer and attention-only bounds under reasonable assumptions), establishing depth as a key resource for scaling context beyond simple dependencies. These insights enhance understanding of in-context learning mechanisms and the architectural features that enable long-range dependency capture in transformers.

Abstract

symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and

head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as

grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with

layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous

symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.

Paper Structure (40 sections, 11 theorems, 121 equations, 10 figures, 3 tables)

This paper contains 40 sections, 11 theorems, 121 equations, 10 figures, 3 tables.

Introduction
Notation.
Related work
Preliminaries
Markov processes
Transformer architecture
Understanding the empirical behavior of transformers
Warming up: Attention-only transformers
Understanding the role of non-linearity: Constant-depth constructions
Modification to the standard transformer architecture.
Proof sketch
Realizing $L_2$-norm attention (eq. \ref{['eq:L2att']}).
Lower bounds on transformer size
Conditional lower bounds on attention-only transformers
Conclusion
...and 25 more sections

Key Result

Theorem 4.1

The conditional $1$-gram model can be represented by a $2$-layer and $1$-head attention-only transformer with embedding dimension $d = 3S+2$.

Figures (10)

Figure 1: $k^{\text{th}}$-order Markov processes for $k=4$. The next symbol $X_{n+1}$ in the sequence is sampled from the distribution $P(\cdot | X_n, X_{n-1}, X_{n-2}, X_{n-3})$ which only depends on the last $k (=4)$ symbols (marked in red).
Figure 2: Conditional $k$-gram model. The conditional $k$-gram is the in-context estimate of the Markov process and is realized in two steps. The first step is to find the locations in the sequence (marked red) which match the final $k$ symbols (functionally, a $k^{\text{th}}$-order induction head). The conditional $k$-gram model returns the uniform distribution over the next symbol at these locations (marked blue).
Figure 3: Transformer architecture. POS refers to the relative position encodings.
Figure 4: Gap with the optimal test loss for $(a)$ a $2$-layer, $1$-head transformer model (above), and $(b)$ a $3$-layer, $1$-head transformer (below), averaged over $3$ runs for each $k$. The models learn the conditional $k$-gram model for randomly sampled $k$-th order Markov processes, even for large $k$.
Figure 5: $k^{\text{th}}$-order induction head for $k=2$. The attention pattern $\operatorname{att}_{T,n}$ is maximized for those values of $n$ at which $x_{T-j+1} = x_{n-j}$ for all $j \in [k]$. These are the positions where the $k$-length prefix at those positions matches with the last $k$ symbols in the sequence.
...and 5 more figures

Theorems & Definitions (18)

Definition 2.1: Conditional $k$-gram model
Theorem 4.1
Remark 4.2
Definition 4.3: Higher-order induction head
Theorem 4.4
Theorem 4.5
Theorem 5.1
Remark 5.2
Theorem 6.1
Theorem 6.3
...and 8 more

Transformers on Markov Data: Constant Depth Suffices

TL;DR

Abstract

Transformers on Markov Data: Constant Depth Suffices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (18)