Directed Metric Structures arising in Large Language Models

Stéphane Gaubert; Yiannis Vlassopoulos

Directed Metric Structures arising in Large Language Models

Stéphane Gaubert, Yiannis Vlassopoulos

TL;DR

The paper shows that conditional word-extension probabilities from large language models induce a directed metric on texts, turning negative log-probabilities into a rich geometric structure. It constructs directed metric polyhedra $P(\mathcal{L})$ and $Q(\mathcal{L})$ with Yoneda and co-Yoneda embeddings placing texts on extremal rays, and demonstrates that $P(\mathcal{L})$ is the $(\min,+)$-span of these rays while its dual $\widehat{P}(\mathcal{L})$ is the $(\min,+)$-span of the co-embeddings, linking semantics to geometry. A duality web connects text extensions and restrictions via extended semirings, Isbell completion, and lattice completions, with extremal rays corresponding to connected lower sets and offering a canonical decomposition of semantic space into principal and non-principal components. The framework provides a tropical-algebraic view of text representations, showing how short (one-word) extensions can be aggregated into longer text vectors through Boltzmann-weighted sums and how adding data preserves isometric embeddings, echoing attention mechanisms in transformers. Overall, the work delivers a mathematically explicit, category-informed picture of how language structure and meaning emerge from probabilistic extensions, with potential implications for representation learning and interpretability in LLMs.

Abstract

Large Language Models are transformer neural networks which are trained to produce a probability distribution on the possible next words to given texts in a corpus, in such a way that the most likely word predicted is the actual word in the training text. In this paper we find what is the mathematical structure defined by such conditional probability distributions of text extensions. Changing the view point from probabilities to -log probabilities we observe that the subtext order is completely encoded in a metric structure defined on the space of texts $\mathcal{L}$, by -log probabilities. We then construct a metric polyhedron $P(\mathcal{L})$ and an isometric embedding (called Yoneda embedding) of $\mathcal{L}$ into $P(\mathcal{L})$ such that texts map to generators of certain special extremal rays. We explain that $P(\mathcal{L})$ is a $(\min,+)$ (tropical) linear span of these extremal ray generators. The generators also satisfy a system of $(\min+)$ linear equations. We then show that $P(\mathcal{L})$ is compatible with adding more text and from this we derive an approximation of a text vector as a Boltzmann weighted linear combination of the vectors for words in that text. We then prove a duality theorem showing that texts extensions and text restrictions give isometric polyhedra (even though they look a priory very different). Moreover we prove that $P(\mathcal{L})$ is the lattice closure of (a version of) the so called, Isbell completion of $\mathcal{L}$ which turns out to be the $(\max,+)$ span of the text extremal ray generators. All constructions have interpretations in category theory but we don't use category theory explicitly. The categorical interpretations are briefly explained in an appendix. In the final appendix we describe how the syntax to semantics problem could fit in a general well known mathematical duality.

Directed Metric Structures arising in Large Language Models

TL;DR

Abstract

, by -log probabilities. We then construct a metric polyhedron

and an isometric embedding (called Yoneda embedding) of

into

such that texts map to generators of certain special extremal rays. We explain that

is a

(tropical) linear span of these extremal ray generators. The generators also satisfy a system of

linear equations. We then show that

is compatible with adding more text and from this we derive an approximation of a text vector as a Boltzmann weighted linear combination of the vectors for words in that text. We then prove a duality theorem showing that texts extensions and text restrictions give isometric polyhedra (even though they look a priory very different). Moreover we prove that

is the lattice closure of (a version of) the so called, Isbell completion of

which turns out to be the

span of the text extremal ray generators. All constructions have interpretations in category theory but we don't use category theory explicitly. The categorical interpretations are briefly explained in an appendix. In the final appendix we describe how the syntax to semantics problem could fit in a general well known mathematical duality.

Paper Structure (17 sections, 44 theorems, 172 equations, 3 figures)

This paper contains 17 sections, 44 theorems, 172 equations, 3 figures.

Overview
Acknowledgements
From probabilities of text extensions to distances
From the text metric space $\mathcal{L}$ to the polyhedra $P(\mathcal{L})$ and $Q(\mathcal{L})$
Texts define special Extremal rays of $P(\mathcal{L})$ and $Q(\mathcal{L})$
All Extremal rays correspond to connected lower sets of $\mathcal{L}$
The polyhedron $P(\mathcal{L})$ as a $(\min,+)$ linear space
$P(\mathcal{L})$ and $\widehat{P}(\mathcal{L})$ as Semantic spaces
From one word text extensions to longer extensions
Compatibility of $P(\mathcal{L})$ with adding more texts
Approximation of a text vector in terms of word vectors
Duality between text extensions and restrictions
Extremal Rays in terms of text vectors
$P^-(\mathcal{L})$ as the lattice completion of the Isbell completion
Some comments about Probabilistic Language Models
...and 2 more sections

Key Result

Proposition 1

The map $d$ satisfies the triangle inequality: and equality holds if and only if $a_i\leqslant a_j\leqslant a_k$ or $a_i\not\leqslant a_k$.

Figures (3)

Figure 1: The cross section $\widehat{Q}_0(\mathcal{L})$ of the polyhedral cone $\hat{Q}(\mathcal{L})$ arising from the metric of $d$ (left). Every vector $d(r,-), d(c,-),d(rc,-)$ determines an extreme point of the cross section, denoted by $r$, $c$, or $rc$. There is a fourth extreme point (shown in gray) corresponding to a non-principal upper set. The cross section $Q_0(\mathcal{L})$ (right). There are three extreme points, which correspond to the vectors $d(-,r), d(-,c), d(-,rc)$.
Figure 2: The duality between the columns and row spaces of metric matrices (\ref{['prop-antiisom']} and \ref{['th-4']}) illustrated. On the right $\operatorname{Im}(d^M_{\min})$ and on the left $\operatorname{Im}((d^M_{\min})^t)$
Figure 3: Tropical module generated by the discrete metric $d_2$ of \ref{['e-def-d2']}. The pseudo-vertices (vertices of the polyhedral complex that do not arise from tropical generators) are shown in gray. (left) The (max,+)-span (right).

Theorems & Definitions (131)

Remark 1
Definition 1
Definition 2
Remark 2
Definition 3
Proposition 1
proof
Corollary 1
Remark 3
Remark 4
...and 121 more

Directed Metric Structures arising in Large Language Models

TL;DR

Abstract

Directed Metric Structures arising in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (131)