Table of Contents
Fetching ...

The Geometry of Tokens in Internal Representations of Large Language Models

Karthik Viswanathan, Yuri Gardinazzi, Giada Panerai, Alberto Cazzaniga, Matteo Biagetti

TL;DR

Using a mean-field interpretation, the paper treats token embeddings across transformer layers as samples from an empirical measure $\mu=\frac{1}{n}\sum_{j=1}^n \delta_{x_j}$ and analyzes token geometry via intrinsic dimension $\hat{d}$, neighborhood overlap $\chi_k^{\ell,m}$, and cosine similarity to link geometry with next-token loss. Across Llama, Mistral, and Pythia on Pile-10K prompts, ID peaks appear in early-to-middle layers and grow with token shuffling, while cosine similarity increases and neighborhood coherence declines near the ID peak for shuffled data. A robust cross-model correlation between token-level ID and average cross-entropy loss is demonstrated, with theoretical framing tying ID to logits and contextual entropy through the softmax mechanism. The findings propose intrinsic dimension as a diagnostic tool for model behavior and training dynamics, enabling unsupervised interpretation of prompt processing in large language models.

Abstract

We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.

The Geometry of Tokens in Internal Representations of Large Language Models

TL;DR

Using a mean-field interpretation, the paper treats token embeddings across transformer layers as samples from an empirical measure and analyzes token geometry via intrinsic dimension , neighborhood overlap , and cosine similarity to link geometry with next-token loss. Across Llama, Mistral, and Pythia on Pile-10K prompts, ID peaks appear in early-to-middle layers and grow with token shuffling, while cosine similarity increases and neighborhood coherence declines near the ID peak for shuffled data. A robust cross-model correlation between token-level ID and average cross-entropy loss is demonstrated, with theoretical framing tying ID to logits and contextual entropy through the softmax mechanism. The findings propose intrinsic dimension as a diagnostic tool for model behavior and training dynamics, enabling unsupervised interpretation of prompt processing in large language models.

Abstract

We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.
Paper Structure (21 sections, 13 equations, 21 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 13 equations, 21 figures, 1 table, 1 algorithm.

Figures (21)

  • Figure 1: Shuffling algorithm
  • Figure 2: Average Cosine Similarity. Left Panel: average cosine similarity among tokens for a single prompt as a function of model layers. Right Panel: average cosine similarity averaged over $2244$ prompts as a function of layers for the full shuffle ($S=5$) and the structured case ($S=0$). The color bar indicates the shuffle index $S$. The shaded regions indicate the standard deviation from the mean. All curves have been calculated for the Llama model.
  • Figure 3: Intrinsic Dimension. Left Panel: intrinsic dimension for a single random prompt as a function of model layers. Right Panel: intrinsic dimension averaged over $2244$ prompts as a function of layers for the full shuffle ($S=5$) and the structured case ($S=0$). The shaded regions indicate the standard deviation from the mean. The color bar indicates the shuffle index $S$. All curves have been calculated for the Llama model.
  • Figure 4: Angle distribution between nearest neighbors. Left Panel: histogram of the angles between the first and second nearest neighbor at layer $10$ of the Llama model for a single prompt for the full shuffle case and structured case. The dotted vertical lines indicate the average angle between the nearest neighbors in both cases. Right Panel: histogram of the average angle between the first and second nearest neighbor at layer $10$ of the Llama model in the fully shuffled (orange) and structured case (blue). The histograms are computed from $2244$ prompts in each case.
  • Figure 5: Neighborhood Overlap. Left Panel: neighborhood overlap for a single random prompt as a function of model layers for $k_{\rm{NN}} = 2$. The colorbar indicates the shuffle index $S$. Right Panel: neighborhood overlap averaged over $2244$ prompts as a function of layers for the full shuffle ($S=5$) and the structured case ($S=0$). The shaded regions indicate the standard deviation from the mean and the grey region indicates the region around the ID peak when the shuffled prompts have a lower NO than the structured prompts. All curves have been calculated for the Llama model.
  • ...and 16 more figures