Table of Contents
Fetching ...

Finding Belief Geometries with Sparse Autoencoders

Matthew Levinson

Abstract

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.

Finding Belief Geometries with Sparse Autoencoders

Abstract

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), -subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry (). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon ) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.

Paper Structure

This paper contains 41 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the belief geometry discovery pipeline. Text sequences are processed through Gemma-2-9B, and residual-stream activations at layer 20 are encoded by a GemmaScope JumpReLU SAE. Decoder directions of SAE latents are clustered into candidate latent groups using $k$-subspace clustering. Each candidate group is then fit with AANet to test for simplex structure and recover barycentric coordinates. The resulting candidate belief geometries are evaluated using barycentric predictive advantage and causal steering as the primary validation analyses.
  • Figure 2: Cluster 768_596: per-latent centroid positions. Mean barycentric centroid of each of the six latents. The latents partition across the three vertices, consistent with vertex-specialized feature coding.
  • Figure 3: Per-sub-component PCA projections, toy model layer 1. Each panel shows the PCA subspace that best reveals one sub-component's geometry, colored by that component's true discrete output token. Every panel also exhibits clear separation for at least one other sub-component, demonstrating cross-component entanglement.
  • Figure 4: Two representative latents from Cluster 4 (multipartite toy model, layer 1, TopK $K=12$). Each panel shows KDE-smoothed activation density over all five component belief geometries. The assigned component is marked $\star$ ($R^2 = 0.89$). Top: Latent 4 fires near the top vertex of each Mess3 simplex and at the centre of the Tom Quantum disks. Bottom: Latent 26 fires near the base vertices and along the annular rim of the Tom Quantum disks. The two latents tile complementary regions of the same geometry, demonstrating geometry-consistent partitioning of belief-state information.
  • Figure 5: Cluster 512_181: barycentric vs. best-latent $R^2$. Barycentric coordinates (mean $R^2 = 0.612$) outperform the best individual latent (mean $R^2 = 0.539$) for every one of the 50 tokens (Wilcoxon $p < 10^{-15}$).
  • ...and 2 more figures