From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak
TL;DR
This work establishes a rigorous link between one-layer self-attention and Context-Conditioned Markov Chains (CCMC), enabling a tractable, convex maximum-likelihood view of attention dynamics under the weight-tying constraint $C E^T = I_K$. It proves identifiability and consistent learning from prompts via connectivity-coverage conditions, and provides finite-sample guarantees with rates scaling as ${K^2/n}$, clarifying how prompt distributions affect learnability. The paper also analyzes learning from a single autoregressive trajectory, revealing distribution-collapse phenomena that offer a mathematical explanation for repetition in language models, and extends the theory to position-aware self-attention through positional encoding. Together, these results yield a simple yet powerful framework to study self-attention, its optimization landscape, and its generative properties across prompting and trajectory regimes.
Abstract
Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.
