The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, Max Tegmark
TL;DR
This work investigates the geometry of sparse autoencoder features in large language models across three scales: atom, brain, and galaxy. It reveals crystal-like relations among concepts after removing distracting features with linear discriminant analysis, demonstrates spatially localized functional lobes that cluster by co-occurrence, and shows a non-isotropic, power-law eigenvalue structure with a mid-layer emphasis. The methods combine Crystal, Lobe, and spectral analyses with entropy-based clustering to quantify structure and establish significant correspondences between functional and geometric organization. The findings advance interpretability of SAE-derived concept spaces and suggest pathways to improve robustness and safety by understanding latent geometry across model layers.
Abstract
Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.
