Table of Contents
Fetching ...

The Geometry of Concepts: Sparse Autoencoder Feature Structure

Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark

TL;DR

This work investigates the geometry of sparse autoencoder features in large language models across three scales: atom, brain, and galaxy. It reveals crystal-like relations among concepts after removing distracting features with linear discriminant analysis, demonstrates spatially localized functional lobes that cluster by co-occurrence, and shows a non-isotropic, power-law eigenvalue structure with a mid-layer emphasis. The methods combine Crystal, Lobe, and spectral analyses with entropy-based clustering to quantify structure and establish significant correspondences between functional and geometric organization. The findings advance interpretability of SAE-derived concept spaces and suggest pathways to improve robustness and safety by understanding latent geometry across model layers.

Abstract

Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.

The Geometry of Concepts: Sparse Autoencoder Feature Structure

TL;DR

This work investigates the geometry of sparse autoencoder features in large language models across three scales: atom, brain, and galaxy. It reveals crystal-like relations among concepts after removing distracting features with linear discriminant analysis, demonstrates spatially localized functional lobes that cluster by co-occurrence, and shows a non-isotropic, power-law eigenvalue structure with a mid-layer emphasis. The methods combine Crystal, Lobe, and spectral analyses with entropy-based clustering to quantify structure and establish significant correspondences between functional and geometric organization. The findings advance interpretability of SAE-derived concept spaces and suggest pathways to improve robustness and safety by understanding latent geometry across model layers.

Abstract

Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.

Paper Structure

This paper contains 12 sections, 9 equations, 12 figures.

Figures (12)

  • Figure S1: Parallelogram and trapezoid structure is revealed (left) when distractor dimensions were projected out from the activations using LDA. LDA results in tighter clusters of pairwise Gemma-2-2b activation differences (right), where each cluster corresponds to a different semantic transformation. Distractor features are defined as those that are not related to semantics of the text; for instance, the first principal component of Gemma-2-2b's Layer 0 activations (top left figure on the right panel) represents word length. Parallelogram or trapezoid structures suggest that there is a unique direction in the activation space that represents each semantic transformation.
  • Figure S2: Features in the SAE feature point cloud identified that tend to fire together within documents are seen to also be geometrically co-located in functional "lobes", here down-projected to 2D with t-SNE with point size proportional to feature frequency. A 2-lobe partition (left) is seen to break the point cloud into roughly equal parts, active on code/math documents and English language documents, respectively. A 3-lobe partition (right) is seen to mainly subdivide the English lobe into a part for short messages and dialogue (e.g., chat rooms and parliament proceedings) and one primarily containing long-form scientific papers.
  • Figure S3: Comparison of the lobe partitions of the SAE point cloud discovered with different affinity measures, with the same t-SNE projection as Figure \ref{['fig:annotated-lobes']}. In the top left, we show clusters computed from geometry, the cosine similarity between features as the affinity score for spectral clustering. All other measures are based on whether SAE features co-occur (fire together) within 256-token blocks, using different measures of affinity. Although the phi coefficient predicts spatial structure best, all co-occurrence measures are seen to discover the code/math lobe.
  • Figure S4: (top left): Adjusted mutual information between spatial clusters and functional (co-occurrence-based) clusters. (top right): logistic regression balanced test accuracy, predicting co-occurrence-based cluster label from position. (bottom left): Adjusted mutual information with randomly permuted cosine similarity-based clustering labels. (bottom right): balanced test accuracy with random unit-norm feature vectors. The statistical significance reported is for phi-based clustering into lobes.
  • Figure S5: Fraction of contexts in which each lobe had the highest proportion of activating features. For each document type, these fractions sum to 1 across the lobes. We see that lobe 2 typically disproportionately activates on code and math documents. Lobe 0 and 1 activate on other documents, with lobe 0 activating more on documents containing short text and dialogue (chat comments, parliamentary proceedings) and lobe 1 activating more on scientific papers.
  • ...and 7 more figures