The magnitude of categories of texts enriched by language models
Tai-Danae Bradley, Juan Pablo Vigneaux
TL;DR
The paper develops a rigorous link between autoregressive language models and enriched category theory by encoding next-token probabilities as hom-objects, forming $[0,1]$- and $[0,\infty]$-categories of texts. It derives an explicit magnitude formula for the LM-induced generalized metric space, showing Mag$(t\mathcal{M})=(t-1)\sum_{x\in\mathrm{ob}(\mathcal{M})\setminus T(\bot)}H_t(p_x)+\#(T(\bot))$, whose $t\to1$ limit recovers a sum of Shannon entropies, thus connecting magnitude to a partition-function-like quantity. The work further expresses magnitude via magnitude homology, identifying $H_{0,0}$ and $H_{1,\ell}$ as fundamental components and conjecturing higher homology vanish, thereby linking information-theoretic entropy with topological invariants. Overall, this framework provides a principled geometric and topological lens for analyzing language-model semantics and their uncertainty structure, with potential implications for assessing model diversity and information content.
Abstract
The purpose of this article is twofold. Firstly, we use the next-token probabilities given by a language model to explicitly define a category of texts in natural language enriched over the unit interval, in the sense of Bradley, Terilla, and Vlassopoulos. We consider explicitly the terminating conditions for text generation and determine when the enrichment itself can be interpreted as a probability over texts. Secondly, we compute the Möbius function and the magnitude of an associated generalized metric space of texts. The magnitude function of that space is a sum over texts (prompts) of the $t$-logarithmic (Tsallis) entropies of the next-token probability distributions associated with each prompt, plus the cardinality of the model's possible outputs. A suitable evaluation of the magnitude function's derivative recovers a sum of Shannon entropies, which justifies seeing magnitude as a partition function. Following Leinster and Shulman, we also express the magnitude function of the generalized metric space as an Euler characteristic of magnitude homology and provide an explicit description of the zeroeth and first magnitude homology groups.
