Table of Contents
Fetching ...

Lines of Thought in Large Language Models

Raphaël Sarfati, Toni J. B. Liu, Nicolas Boullé, Christopher J. Earls

TL;DR

This paper treats large language models as dynamical systems, studying how embedded prompts traverse latent space through transformer layers as lines of thought (LoT). It shows that independent LoT ensembles cluster along a non-Euclidean, low-dimensional manifold and can be described by a stochastic model with a small number of parameters, extending to continuous time via Langevin dynamics and a Fokker-Planck formulation. The authors propose a discrete-time linear update with rotation and stretch, plus a Gaussian residual, and generalize it to continuous time; they validate the approach with GPT-2 and other models (Llama 2, Mistral, Llama 3.2), revealing both robust transport patterns and last-layer anomalies under reinitialization or fine-tuning. These results offer a compact, probabilistic description of high-dimensional transformer computations, with potential implications for interpretability, model diagnostics, and future hybrid architectures that separate deterministic transport from stochastic, meaning-bearing variability.

Abstract

Large Language Models achieve next-token prediction by transporting a vectorized piece of text (prompt) across an accompanying embedding space under the action of successive transformer layers. The resulting high-dimensional trajectories realize different contextualization, or 'thinking', steps, and fully determine the output probability distribution. We aim to characterize the statistical properties of ensembles of these 'lines of thought.' We observe that independent trajectories cluster along a low-dimensional, non-Euclidean manifold, and that their path can be well approximated by a stochastic equation with few parameters extracted from data. We find it remarkable that the vast complexity of such large models can be reduced to a much simpler form, and we reflect on implications.

Lines of Thought in Large Language Models

TL;DR

This paper treats large language models as dynamical systems, studying how embedded prompts traverse latent space through transformer layers as lines of thought (LoT). It shows that independent LoT ensembles cluster along a non-Euclidean, low-dimensional manifold and can be described by a stochastic model with a small number of parameters, extending to continuous time via Langevin dynamics and a Fokker-Planck formulation. The authors propose a discrete-time linear update with rotation and stretch, plus a Gaussian residual, and generalize it to continuous time; they validate the approach with GPT-2 and other models (Llama 2, Mistral, Llama 3.2), revealing both robust transport patterns and last-layer anomalies under reinitialization or fine-tuning. These results offer a compact, probabilistic description of high-dimensional transformer computations, with potential implications for interpretability, model diagnostics, and future hybrid architectures that separate deterministic transport from stochastic, meaning-bearing variability.

Abstract

Large Language Models achieve next-token prediction by transporting a vectorized piece of text (prompt) across an accompanying embedding space under the action of successive transformer layers. The resulting high-dimensional trajectories realize different contextualization, or 'thinking', steps, and fully determine the output probability distribution. We aim to characterize the statistical properties of ensembles of these 'lines of thought.' We observe that independent trajectories cluster along a low-dimensional, non-Euclidean manifold, and that their path can be well approximated by a stochastic equation with few parameters extracted from data. We find it remarkable that the vast complexity of such large models can be reduced to a much simpler form, and we reflect on implications.
Paper Structure (38 sections, 19 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 38 sections, 19 equations, 15 figures, 1 table, 1 algorithm.

Figures (15)

  • Figure 1: (a) Lines of thought (blue to red) for an ensemble of 1000 pseudo-sentences of 50 tokens each, projected along the first 3 singular vectors after the last layer ($t=24$). They appear to form a tight bundle, with limited variability around a common average path. (b) Representation of the low-dimensional, ribbon-shaped manifold in $\mathcal{S}$ (projected along 3 Cartesian coordinates). Positions are plotted for $t=12$ (green) to $t=24$ (yellow).
  • Figure 2: (a) Angle between the first 4 singular vectors at $(t_1,t_2)$, $\arccos ( {\bm{u}}_i^{(t_1)}\cdot {\bm{u}}_i^{(t_2)})$, for $i= \{ 1,2,3,4\}$ (top-left, top-right, bottom-left, bottom-right, respectively). (b) Singular values for $t = 1, \dots, 24$ (blue to red). Clusters stretch more and more after each layer. The leading singular values, $\sigma_1(t)$, have been omitted for clarity. (c) Average (over all trajectories) KL divergence between reduced dimensionality trajectories output and true output distributions, as the dimensionality $K$ is increased. The red dashes line shows the average KL divergence for output distributions from unrelated inputs
  • Figure 3: Extrapolated token positions $\Tilde{{\bm{x}}}^{(k)}$ (blue) from $t = \{ 12,14,16,18\}$ to $t+\tau = \{ t+1,\dots,21\}$, compared to their true positions ${\bm{x}}^{(k)}$ (gray), projected in the $( {\bm{u}}_2^{(t)},{\bm{u}}_3^{(t)})$ planes.
  • Figure 4: Statistics of $\delta {\bm{x}}(t,\tau)$: mean $\mu$, variance $\sigma^2$, excess kurtosis $\kappa$. Brackets $\langle \dots \rangle$ denote average over directions ${\bm{e}}_i$ (see \ref{['fig:noise-schematic']} for details). (a) For all $(t,t+\tau)$, $\mu \simeq 0$ (that is, $\mu / \sigma \ll 1$). (b)$\log(\sigma^2)$ increases linearly in time, only depends on $t+\tau$. (c) The excess kurtosis (kurtosis minus 3) remains close to 0, indicating Gaussianity (except in early layers).
  • Figure 5: Simulated distributions for $t=12$, $t+\tau = \{ 12, 13, 14, 15, 16\}$, projected on the $\left( {\bm{u}}_1,{\bm{u}}_2\right)$ plane (top row) and the $\left( {\bm{u}}_3,{\bm{u}}_4\right)$ plane (bottom row). Distributions have been approximated from ensemble trajectories, 10 trajectories for each initial point. Background lines indicate true distributions, thin lines on top indicate simulations.
  • ...and 10 more figures