A Markov Categorical Framework for Language Modeling
Yifan Zhang
TL;DR
The paper introduces a principled framework that treats autoregressive language-model generation as a composition of Markov kernels within the Markov category $\texttt{Stoch}$, unifying the training objective, learning geometry, and model capabilities. It shows that negative log-likelihood (NLL) minimization acts as a KL-based compression objective while simultaneously shaping the latent representation space via a pullback Fisher–Rao metric, aligning representations with the predictive structure of the data through a spectral/CCA-like mechanism. The work formalizes how speculative decoding leverages information surplus in hidden states and provides a rigorous connection between NLL and implicit spectral learning, Dirichlet-energy regularization on a predictive-similarity graph, and the emergence of structured geometric directions in $\mathcal{H}$. Overall, it offers a mathematically coherent lens that integrates category theory, information geometry, and spectral methods to explain how training shapes internal representations and capabilities of large language models, with practical guidance for analyzing and harnessing these geometric properties.
Abstract
Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the information surplus a model's hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data's intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result shows that, under a linear-softmax head with bounded features, minimizing NLL induces spectral alignment: the learned representation space aligns with the eigenspectrum of a predictive similarity operator. This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.
