Table of Contents
Fetching ...

The Origins of Representation Manifolds in Large Language Models

Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley

TL;DR

The paper introduces a formal framework for representation manifolds in large language models by extending the linear representation hypothesis to a multidimensional, manifold-valued view of features. It posits a continuous correspondence between a feature's metric space $\mathcal{Z}_{\mathtt{f}}$ and a representation manifold $\mathcal{M}_{\mathtt{f}}$ via a map $\phi_{\mathtt{f}}$, with local cosine similarity $\text{CosSim}(\phi(z),\phi(z')) = g_{\mathtt{f}}(\mathsf{d}_{\mathtt{f}}(z,z')^{2})$ and a key theorem linking path lengths on the feature space to those on the representation manifold. Through empirical analyses on text embeddings and token activations (e.g., colours, years, dates), the work shows homeomorphism between feature spaces and their representation manifolds and, in several cases, near-isometry after appropriate transformations, implying that intrinsic feature geometry can be recovered from representation geometry. The authors discuss limitations, such as manual metric specification and manifold estimation challenges, and propose future directions toward manifold-aware sparse autoencoders and learning the map $\phi_{\mathtt{f}}$ to enable mechanistic interventions in model behavior.

Abstract

There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.

The Origins of Representation Manifolds in Large Language Models

TL;DR

The paper introduces a formal framework for representation manifolds in large language models by extending the linear representation hypothesis to a multidimensional, manifold-valued view of features. It posits a continuous correspondence between a feature's metric space and a representation manifold via a map , with local cosine similarity and a key theorem linking path lengths on the feature space to those on the representation manifold. Through empirical analyses on text embeddings and token activations (e.g., colours, years, dates), the work shows homeomorphism between feature spaces and their representation manifolds and, in several cases, near-isometry after appropriate transformations, implying that intrinsic feature geometry can be recovered from representation geometry. The authors discuss limitations, such as manual metric specification and manifold estimation challenges, and propose future directions toward manifold-aware sparse autoencoders and learning the map to enable mechanistic interventions in model behavior.

Abstract

There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.

Paper Structure

This paper contains 14 sections, 3 theorems, 19 equations, 4 figures.

Key Result

Proposition 1

Under Hypothesis hyp:correspondence, the map $\phi_{\mathtt{f}}: \mathcal{Z}_{\mathtt{f}} \rightarrow \mathcal{M}_\mathtt{f}$ is a homeomorphism.This is simply a restatement of the well-established fact that a continuous invertible map over a compact domain has a continuous inverse sutherland_introd

Figures (4)

  • Figure 1: Representation manifolds in large language models: colours, years and dates. The first and third example show text embeddings obtained from OpenAI's text-embedding-large-3 model from prompts relating to English names for colours and dates of the year, respectivly. The second example shows token activations from layer 7 of GPT2-small, which were studied in engels_not_2025. The token activations were processed via an SAE to extract a feature corresponding to years of the twentieth century as in engels_not_2025, and normalized to have norm one. For each example, we perform principal component analysis (PCA) to reduce the dimension to three and display the resulting point clouds from two perspectives. The embeddings of English names for colours are displayed in their respective colour value. Years are coloured from blue (1900) through green to yellow (1999), and dates are coloured from white (1st Janurary) through blue to black (1st July) through red and back to white.
  • Figure 2: Representation manifolds in token activations from layer 8 of Mistral 7B, processed via an SAE to extract representations of 'months of the year' and 'days of the week', as in engels_not_2025. We normalise the representations to have norm one, and perform PCA into three dimensions. The top-down view of the first two principal components, which was shown in engels_not_2025, obscures manifold structure which weaves through the third principal component.
  • Figure 3: Evidence for Hypothesis \ref{['hyp:cossim']} and its implications in Theorem \ref{['thm:isometry']}. For each pair of representations, we plot their cosine similarities (first row) and estimated manifold distances (second row) against their (squared) distance in a putative metric space. We report the Chatterjee ($\xi$) and Pearson ($\rho$) correlation coefficients, respectively. Colours correspond to the colourmaps described in Figure \ref{['fig:manifolds_1']}.
  • Figure 4: Evidence against isometry with respect to the metric space $\mathcal{Z}_{\texttt{years}} = [1900,1999]$, $\mathsf{d}_{\texttt{year}}(x,y) = |x-y|$. There is no clear regular linear relationship between distances in this metric space and estimated distances on the representation manifold. The colours indicate that distances between more recent years are expanded on the manifold.

Theorems & Definitions (7)

  • Definition 1: Multidimensional linear representation hypothesis
  • Definition 2
  • Proposition 1
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:isometry']}
  • Lemma 1
  • proof