Table of Contents
Fetching ...

A Markov Categorical Framework for Language Modeling

Yifan Zhang

TL;DR

The paper introduces a principled framework that treats autoregressive language-model generation as a composition of Markov kernels within the Markov category $\texttt{Stoch}$, unifying the training objective, learning geometry, and model capabilities. It shows that negative log-likelihood (NLL) minimization acts as a KL-based compression objective while simultaneously shaping the latent representation space via a pullback Fisher–Rao metric, aligning representations with the predictive structure of the data through a spectral/CCA-like mechanism. The work formalizes how speculative decoding leverages information surplus in hidden states and provides a rigorous connection between NLL and implicit spectral learning, Dirichlet-energy regularization on a predictive-similarity graph, and the emergence of structured geometric directions in $\mathcal{H}$. Overall, it offers a mathematically coherent lens that integrates category theory, information geometry, and spectral methods to explain how training shapes internal representations and capabilities of large language models, with practical guidance for analyzing and harnessing these geometric properties.

Abstract

Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the information surplus a model's hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data's intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result shows that, under a linear-softmax head with bounded features, minimizing NLL induces spectral alignment: the learned representation space aligns with the eigenspectrum of a predictive similarity operator. This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.

A Markov Categorical Framework for Language Modeling

TL;DR

The paper introduces a principled framework that treats autoregressive language-model generation as a composition of Markov kernels within the Markov category , unifying the training objective, learning geometry, and model capabilities. It shows that negative log-likelihood (NLL) minimization acts as a KL-based compression objective while simultaneously shaping the latent representation space via a pullback Fisher–Rao metric, aligning representations with the predictive structure of the data through a spectral/CCA-like mechanism. The work formalizes how speculative decoding leverages information surplus in hidden states and provides a rigorous connection between NLL and implicit spectral learning, Dirichlet-energy regularization on a predictive-similarity graph, and the emergence of structured geometric directions in . Overall, it offers a mathematically coherent lens that integrates category theory, information geometry, and spectral methods to explain how training shapes internal representations and capabilities of large language models, with practical guidance for analyzing and harnessing these geometric properties.

Abstract

Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the information surplus a model's hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data's intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result shows that, under a linear-softmax head with bounded features, minimizing NLL induces spectral alignment: the learned representation space aligns with the eigenspectrum of a predictive similarity operator. This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.

Paper Structure

This paper contains 42 sections, 22 theorems, 102 equations, 1 figure, 1 table.

Key Result

Lemma 2.4

The spaces $\mathbb{V}^*$ (countable disjoint union of finite products), $\mathcal{H}_{\mathrm{seq\_emb}}$ (countable disjoint union of Euclidean products), $\mathcal{H}\simeq\mathbb R^{d_{\mathrm{model}}}$, and $\mathbb{V}$ (finite) are standard Borel. If $f_{\mathrm{emb}},f_{\mathrm{bb}}$ are Bore

Figures (1)

  • Figure 1: A conceptual overview of our framework. Center: The core thesis models the Autoregressive generation step as a composition of Markov kernels $k_{\mathrm{gen}} = k_{\mathrm{head}} \circ k_{\mathrm{bb}} \circ k_{\mathrm{emb}}$ in the category $\texttt{Stoch}$. This separates the deterministic context encoding ($k_{\mathrm{emb}}, k_{\mathrm{bb}}$) from the probabilistic output kernel$k_{\mathrm{head}}$, which is parameterized by a deterministic map $g_{\mathrm{head}}\!:\mathcal{H}\to\Delta$. Top: This compositional lens reveals the deeper mechanisms of the NLL objective, which we re-frame as minimizing the average KL divergence between the model and true data kernels. Under additional constraints satisfied by linear-softmax LM heads (see §\ref{['sec:repr_learning_theory']}), we show a conditional spectral connection with a predictive-similarity operator; in all cases, NLL compels the model to learn intrinsic conditional stochasticity (via categorical entropy). Bottom: Pulling back the Fisher–Rao metric endows $\mathcal{H}$ with an information geometry that quantifies predictive sensitivity and clarifies the information surplus used by speculative decoding.

Theorems & Definitions (37)

  • Definition 2.1: Markov Category Fritz2020MC
  • Definition 2.2: Category $\texttt{Stoch}$ Fritz2020MCPerrone2022Ent
  • Remark 2.3: Interpretation
  • Lemma 2.4: Standard Borel and measurability
  • Theorem 2.5: Data Processing Inequality (DPI)
  • Theorem 2.7: NLL Minimization as Average KL Minimization
  • Definition 2.8: Categorical Entropy Perrone2022Ent
  • Remark 2.9: Properties and Connections
  • Remark 3.1: Deterministic parameterization vs. stochastic kernel
  • Proposition 4.1: Tail sum and decay
  • ...and 27 more