The Origins of Representation Manifolds in Large Language Models
Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley
TL;DR
The paper introduces a formal framework for representation manifolds in large language models by extending the linear representation hypothesis to a multidimensional, manifold-valued view of features. It posits a continuous correspondence between a feature's metric space $\mathcal{Z}_{\mathtt{f}}$ and a representation manifold $\mathcal{M}_{\mathtt{f}}$ via a map $\phi_{\mathtt{f}}$, with local cosine similarity $\text{CosSim}(\phi(z),\phi(z')) = g_{\mathtt{f}}(\mathsf{d}_{\mathtt{f}}(z,z')^{2})$ and a key theorem linking path lengths on the feature space to those on the representation manifold. Through empirical analyses on text embeddings and token activations (e.g., colours, years, dates), the work shows homeomorphism between feature spaces and their representation manifolds and, in several cases, near-isometry after appropriate transformations, implying that intrinsic feature geometry can be recovered from representation geometry. The authors discuss limitations, such as manual metric specification and manifold estimation challenges, and propose future directions toward manifold-aware sparse autoencoders and learning the map $\phi_{\mathtt{f}}$ to enable mechanistic interventions in model behavior.
Abstract
There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
