Table of Contents
Fetching ...

How to Tame Your LLM: Semantic Collapse in Continuous Systems

C. M. Wyss

TL;DR

The paper proposes the Semantic Characterization Theorem (SCT), showing that LLMs operating in a continuous latent space exhibit discrete symbolic semantics via spectral lumpability and o-minimal definability. By modeling LLMs as Continuous State Machines with a transfer operator P, it proves that a finite set of dominant eigenfunctions induces semantic basins that align with definable, low-complexity cells. The two-pronged argument—spectral analysis and logical tameness—demonstrates that discrete semantics emerge from continuous computation and that the resulting partitions are equivalent up to measure-zero boundaries. Empirically, diffusion-based experiments on sentence embeddings reveal a triad of dominant semantic dimensions, defensible basins, and an ontological skeleton, supporting the SCT and suggesting practical avenues for prompting and interpretable AI design.

Abstract

We develop a general theory of semantic dynamics for large language models by formalizing them as Continuous State Machines (CSMs): smooth dynamical systems whose latent manifolds evolve under probabilistic transition operators. The associated transfer operator $P: L^2(M,μ) \to L^2(M,μ)$ encodes the propagation of semantic mass. Under mild regularity assumptions (compactness, ergodicity, bounded Jacobian), $P$ is compact with discrete spectrum. Within this setting, we prove the Semantic Characterization Theorem (SCT): the leading eigenfunctions of $P$ induce finitely many spectral basins of invariant meaning, each definable in an o-minimal structure over $\mathbb{R}$. Thus spectral lumpability and logical tameness coincide. This explains how discrete symbolic semantics can emerge from continuous computation: the continuous activation manifold collapses into a finite, logically interpretable ontology. We further extend the SCT to stochastic and adiabatic (time-inhomogeneous) settings, showing that slowly drifting kernels preserve compactness, spectral coherence, and basin structure.

How to Tame Your LLM: Semantic Collapse in Continuous Systems

TL;DR

The paper proposes the Semantic Characterization Theorem (SCT), showing that LLMs operating in a continuous latent space exhibit discrete symbolic semantics via spectral lumpability and o-minimal definability. By modeling LLMs as Continuous State Machines with a transfer operator P, it proves that a finite set of dominant eigenfunctions induces semantic basins that align with definable, low-complexity cells. The two-pronged argument—spectral analysis and logical tameness—demonstrates that discrete semantics emerge from continuous computation and that the resulting partitions are equivalent up to measure-zero boundaries. Empirically, diffusion-based experiments on sentence embeddings reveal a triad of dominant semantic dimensions, defensible basins, and an ontological skeleton, supporting the SCT and suggesting practical avenues for prompting and interpretable AI design.

Abstract

We develop a general theory of semantic dynamics for large language models by formalizing them as Continuous State Machines (CSMs): smooth dynamical systems whose latent manifolds evolve under probabilistic transition operators. The associated transfer operator encodes the propagation of semantic mass. Under mild regularity assumptions (compactness, ergodicity, bounded Jacobian), is compact with discrete spectrum. Within this setting, we prove the Semantic Characterization Theorem (SCT): the leading eigenfunctions of induce finitely many spectral basins of invariant meaning, each definable in an o-minimal structure over . Thus spectral lumpability and logical tameness coincide. This explains how discrete symbolic semantics can emerge from continuous computation: the continuous activation manifold collapses into a finite, logically interpretable ontology. We further extend the SCT to stochastic and adiabatic (time-inhomogeneous) settings, showing that slowly drifting kernels preserve compactness, spectral coherence, and basin structure.

Paper Structure

This paper contains 68 sections, 20 theorems, 62 equations, 1 figure.

Key Result

theorem 3.2

Let be a (possibly stochastic) continuous state machine satisfying Assumptions ass:regularity(A1)–(A5). Let $K : \mathcal{M} \times \mathcal{B}(\mathcal{M}) \to [0,1]$ denote the induced Markov kernel and $P : L^2(\mathcal{M},\mu) \to L^2(\mathcal{M},\mu)$ its associated transfer operator. Then: In particular, the latent semantic space of a CSM, though continuous in its parameters and representa

Figures (1)

  • Figure 1: Empirical illustration of the Semantic Characterization Theorem. (a) Finite dominant modes: spectral gap for $P$. (b) Basins via eigenfunctions: $B_i$ from $\arg\max_j|\varphi_j|$. (c) Ontological skeleton: coarse $G_\theta$ over basins from rollouts. (d) Logical tameness: small models separate basins (proxy for definable boundaries).

Theorems & Definitions (44)

  • definition 2.1: Continuous State Machine
  • remark 2.2
  • definition 2.3: LLM as Continuous State Machine
  • example 2.4: A Minimal Smooth CSM
  • remark 2.5: Stochastic transition kernel
  • example 2.6: A CSM with Stochastic Decoder
  • definition 2.7: Standard Borel Space
  • remark 3.1: Scope and structure of proofs
  • theorem 3.2: Semantic Characterization Theorem
  • proof : Proof sketch and strategy
  • ...and 34 more