Large Language Models as Markov Chains
Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé, Ievgen Redko
TL;DR
The work reframes autoregressive transformer-based LLMs as finite-state Markov chains, enabling principled analysis of inference, pre-training, and in-context learning under realistic, non-iid data conditions. It derives non-iid pre-training sample complexity bounds via Marton couplings and establishes ICL generalization bounds that depend on Markov-chain mixing properties, validating them with modern LLMs (Llama and Gemma families). The theory is supported by numerical experiments showing that recent models approximate the predicted scaling laws and demonstrate coherent behavior under context-driven generation. This Markov-chain perspective provides a concrete mechanism to understand LLM generalization, repetition, and coherence through the stationary distribution and mixing dynamics of the associated transition structure.
Abstract
Large language models (LLMs) are remarkably efficient across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the LLMs' generalization capabilities remains elusive. In our paper, we approach this task by drawing an equivalence between autoregressive transformer-based language models and Markov chains defined on a finite state space. This allows us to study the multi-step inference mechanism of LLMs from first principles. We relate the obtained results to the pathological behavior observed with LLMs such as repetitions and incoherent replies with high temperature. Finally, we leverage the proposed formalization to derive pre-training and in-context learning generalization bounds for LLMs under realistic data and model assumptions. Experiments with the most recent Llama and Gemma herds of models show that our theory correctly captures their behavior in practice.
