Table of Contents
Fetching ...

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak

TL;DR

This work establishes a rigorous link between one-layer self-attention and Context-Conditioned Markov Chains (CCMC), enabling a tractable, convex maximum-likelihood view of attention dynamics under the weight-tying constraint $C E^T = I_K$. It proves identifiability and consistent learning from prompts via connectivity-coverage conditions, and provides finite-sample guarantees with rates scaling as ${K^2/n}$, clarifying how prompt distributions affect learnability. The paper also analyzes learning from a single autoregressive trajectory, revealing distribution-collapse phenomena that offer a mathematical explanation for repetition in language models, and extends the theory to position-aware self-attention through positional encoding. Together, these results yield a simple yet powerful framework to study self-attention, its optimization landscape, and its generative properties across prompting and trajectory regimes.

Abstract

Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

TL;DR

This work establishes a rigorous link between one-layer self-attention and Context-Conditioned Markov Chains (CCMC), enabling a tractable, convex maximum-likelihood view of attention dynamics under the weight-tying constraint . It proves identifiability and consistent learning from prompts via connectivity-coverage conditions, and provides finite-sample guarantees with rates scaling as , clarifying how prompt distributions affect learnability. The paper also analyzes learning from a single autoregressive trajectory, revealing distribution-collapse phenomena that offer a mathematical explanation for repetition in language models, and extends the theory to position-aware self-attention through positional encoding. Together, these results yield a simple yet powerful framework to study self-attention, its optimization landscape, and its generative properties across prompting and trajectory regimes.

Abstract

Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.
Paper Structure (29 sections, 28 theorems, 94 equations, 8 figures)

This paper contains 29 sections, 28 theorems, 94 equations, 8 figures.

Key Result

Lemma 2.2

Let $(X, y)$ be an arbitrary pair of (prompt, next token). Define $\boldsymbol{\pi}^X \in \mathbb{R}^K$ based on ${\bm{P}}^{\bm{W}}$ using Definition def pcmc. We have that

Figures (8)

  • Figure 1: Demonstration of Definition \ref{['def pcmc']}. We provide an example where the vocabulary size $K = 3$ and the input prompt $X = [1, 2, 1]$, which results in a frequency vector $\bm{m}(X)$. ${\bm{P}}$ represents the transition matrix of the base Markov chain.
  • Figure 2: Illustration of the Equivalency between the Attention and PCMC models. We provide an example where the vocabulary size $K=3$ and the input prompt is $X = [1,2, 1]$. The upper figure represents how the token probabilities $\mathbb{S}({\bm{E}}\bm{W}\bm{e}_i)$ can be mapped to a base transition matrix ${\bm{P}}$. The left-lower figure demonstrates the output of the self-attention given an input prompt ${\bm{X}}$. The right-lower figure derives CCMC transitions from this ${\bm{P}}$ given the same prompt. The resulting next token probabilities are the same for both of the models. The masking operation is demonstrated in a more detailed way in Figure \ref{['fig:main_figure2']}.
  • Figure 3: Illustration of co-occurrence graphs for the self-attention (Left) and cross-attention (Right) models. We fit the same input examples where the only difference is the use of self- vs cross-attention (which includes vs excludes the query token '$1$' from the list of key tokens). Following Theorem \ref{['theorem consistency']}, in the cross-attention setting, where the token '$1$' is not contained in the prompt, the co-occurrence graph becomes disconnected, resulting in inconsistent estimation. In contrast, the estimation is consistent for the self-attention model since both inputs share the same query token within their key tokens.
  • Figure 4: Left: Illustration of finite sample learning where the next tokens are sampled from the ground-truth model, which corresponds to single outputs from multiple IID trajectories. Right: In practice, the scenario is analogous to querying language models with prompts on different topics and using the responses to train a tiny model. In Theorem \ref{['theorem consistency']}, we characterize the condition when the tiny model can estimate the ground-truth model consistently.
  • Figure 5: Demonstration of Distribution Collapse/Repetition. Left: An example query where the GPT-2 response quickly degenerates into repetition. Middle: We generate two single self-attention trajectories with a vocabulary containing $K=6$ tokens and plot the empirical token frequencies (for each token in the vocabulary). The upper figure is generated using a randomly initialized transition matrix, while the lower one is generated using the same transition matrix as the upper one except that the diagonal entries are set to $0$, enforcing that the probability of query token $i \to \text{ next token }i$ is $0$. The frequency is calculated as the ratio of token occurrences to the sequence length at that time. Right: Trajectory snapshots with a 10-token window from time index $i$ revealing that token $5$ (upper) / tokens $2 \text{ and } 5$ (lower) dominate the trajectory. The lower right is dominated by two tokens because a single token cannot self-reinforce due to zero diagonals.
  • ...and 3 more figures

Theorems & Definitions (37)

  • Definition 2.1
  • Lemma 2.2
  • Definition 2.4
  • Lemma 2.5
  • Theorem 2.6
  • Definition 3.1
  • Definition 3.3
  • Theorem 3.4
  • Corollary 3.5
  • Lemma 3.6
  • ...and 27 more