Table of Contents
Fetching ...

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda

TL;DR

The paper tackles the interpretability bottleneck of sparse autoencoders by addressing scaling-induced distortions such as feature absorption and splitting. It introduces Matryoshka SAEs, which train nested dictionaries of increasing size to enforce multi-level reconstructions, preserving high-level concepts while enabling finer granularity. Across toy models, TinyStories, and Gemma-2-2B, Matryoshka SAEs show improved disentanglement, reduced feature absorption, and stronger sparse probing and concept removal capabilities, with scalable performance as dictionary size grows. While reconstruction fidelity is slightly traded off, the gains in interpretability and downstream task reliability suggest Matryoshka SAEs offer a practical path for scalable, mechanistic interpretability of large neural systems.

Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

TL;DR

The paper tackles the interpretability bottleneck of sparse autoencoders by addressing scaling-induced distortions such as feature absorption and splitting. It introduces Matryoshka SAEs, which train nested dictionaries of increasing size to enforce multi-level reconstructions, preserving high-level concepts while enabling finer granularity. Across toy models, TinyStories, and Gemma-2-2B, Matryoshka SAEs show improved disentanglement, reduced feature absorption, and stronger sparse probing and concept removal capabilities, with scalable performance as dictionary size grows. While reconstruction fidelity is slightly traded off, the gains in interpretability and downstream task reliability suggest Matryoshka SAEs offer a practical path for scalable, mechanistic interpretability of large neural systems.

Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

Paper Structure

This paper contains 41 sections, 7 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Architecture and Performance of Matryoshka SAEs. (Left) The model learns multiple nested reconstructions simultaneously, with each reconstruction using only a subset of the total latents. This creates pressure for early latents to capture general features while later latents can specialize in more specific concepts. (Right) Comparative metrics between SAEs with average sparsity of 40 active latents per token (L0=40) showing that while Matryoshka SAEs sacrifice a small amount of reconstruction fidelity (higher variance unexplained), they achieve significantly lower feature absorption rates and less feature composition (lower decoder cosine similarity).
  • Figure 2: Toy model decoder vector similarity. Graphical representation of the toy model (left). The heatmaps show the cosine similarity between learned latent vectors and ground-truth feature vectors for the Matryoshka SAE (middle) and Vanilla SAE (right). The Matryoshka SAE shows a clear diagonal structure, which demonstrates disentanglement of the hierarchical features and learning the ground truth. The Vanilla SAE, however, exhibits high similarity between parent and child latents, indicating feature absorption.
  • Figure 3: Toy model activations. Ground-truth feature activations alongside Matryoshka SAE activations and Vanilla activations for one of the parents and its children on the toy model. Notice that in the Vanilla SAE the parent latent (column bracketed in blue) does not fire when its children (bracketed in red) fire. For the activations of all features in the toy model, see Figure \ref{['fig:full-toy-model-acts']}.
  • Figure 4: Feature absorption in a TinyStories model. Example activations showing how a general "female words" latent (S/2/65) develops holes in a larger SAE (S/3/66) as specialized latents for "Lily" (S/3/359) and "Sue" (S/3/861) absorb specific cases. Feature absorption locations are circled in red.
  • Figure 5: Reconstruction performance. The variance explained of Matryoshka SAEs is slightly worse than some competing architectures. However, looking at the downstream LLM CE loss, Matroshka SAEs perform comparable, especially at larger L0s.
  • ...and 14 more figures