Large Language Models as Markov Chains

Oussama Zekri; Ambroise Odonnat; Abdelhakim Benechehab; Linus Bleistein; Nicolas Boullé; Ievgen Redko

Large Language Models as Markov Chains

Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé, Ievgen Redko

TL;DR

The work reframes autoregressive transformer-based LLMs as finite-state Markov chains, enabling principled analysis of inference, pre-training, and in-context learning under realistic, non-iid data conditions. It derives non-iid pre-training sample complexity bounds via Marton couplings and establishes ICL generalization bounds that depend on Markov-chain mixing properties, validating them with modern LLMs (Llama and Gemma families). The theory is supported by numerical experiments showing that recent models approximate the predicted scaling laws and demonstrate coherent behavior under context-driven generation. This Markov-chain perspective provides a concrete mechanism to understand LLM generalization, repetition, and coherence through the stationary distribution and mixing dynamics of the associated transition structure.

Abstract

Large language models (LLMs) are remarkably efficient across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the LLMs' generalization capabilities remains elusive. In our paper, we approach this task by drawing an equivalence between autoregressive transformer-based language models and Markov chains defined on a finite state space. This allows us to study the multi-step inference mechanism of LLMs from first principles. We relate the obtained results to the pathological behavior observed with LLMs such as repetitions and incoherent replies with high temperature. Finally, we leverage the proposed formalization to derive pre-training and in-context learning generalization bounds for LLMs under realistic data and model assumptions. Experiments with the most recent Llama and Gemma herds of models show that our theory correctly captures their behavior in practice.

Large Language Models as Markov Chains

TL;DR

Abstract

Paper Structure (72 sections, 208 equations, 18 figures, 2 tables)

This paper contains 72 sections, 208 equations, 18 figures, 2 tables.

Introduction
Background Knowledge
Large Language Models as Markov Chains
Markov Chain Formalization
Illustration on a Toy Model
Sample Complexity and Generalization
Main Result: Pre-training Sample Complexity
Non-iid data.
Pre-Training Generalization Bound
In-Context Learning of Markov Chains
Numerical Experiments
Conclusion
Roadmap.
Notations
Background on Large Language Models
...and 57 more sections

Figures (18)

Figure 1: LLMs' Sample Complexity. We plot the Massive Multitask Language Understanding (MMLU) hendrycks2021measuring performance with respect to the approximation error $\epsilon$ predicted by \ref{['cor:sample_complexity']}. We set $N^*$ equal to the real number of pre-training tokens. Each point represents a model from the Llama or Gemma families gemmateam2024gemma2improvingopendubey2024llama3. The approximation error $\epsilon$ predicted by our theory correlates with the real performance, with different trends between the models' families.
Figure 2: LLM as a Markov chain. A large language model with vocabulary size $T$ and context window $K$ is equivalent to a Markov chain with a sparse and block-structured transition matrix of size ${\sum_{i\leq K} T^i \sim \mathcal{O}(T^{K})}$. The latter captures all possible outputs of a given LLM for all possible input sequences allowed by its vocabulary and context window.
Figure 3: \ref{['prop:LLM_formal_def']} with $T=2$ and $K=3$.
Figure 4: Markov chain with a small GPT-like model. (a) Transition matrix $\mathbf{Q}_f$ of the model where $\textcolor{red}{\square}$ denotes the examples from the training set. (b) The stationary distribution of the trained model assigns almost uniform probabilities to the states seen during training. (c) Convergence rate to the stationary distribution for the considered toy model along with three LLMs, highlighting the dependence on $K$. The y-axis is the upper bound in \ref{['prop:stationary_distrib']}.
Figure 5: Dependence of $\varepsilon$ on the temperature of the model. (a) For low temperatures, $\varepsilon$ becomes too small to achieve convergence to the stationary distribution. (b)-(c) Increasing the temperature from $1$ to $2$ leads to a $\times 10$ faster convergence. (d) $\varepsilon$ (log-scale) increase for temperature values in $[0.1, 2]$.
...and 13 more figures

Theorems & Definitions (32)

Remark 4.1: Choice of risk
Remark 3.1: Well-posedness of $t_\mathrm{min}$
proof : Proof of \ref{['prop:properties_K']}
proof
proof
proof
proof
proof : Proof of \ref{['prop:LLM_formal_def']}
proof
proof : Proof of \ref{['prop:ergodic_unichains']}
...and 22 more

Large Language Models as Markov Chains

TL;DR

Abstract

Large Language Models as Markov Chains

Authors

TL;DR

Abstract

Table of Contents

Figures (18)

Theorems & Definitions (32)