Table of Contents
Fetching ...

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda

TL;DR

The paper questions whether sparse autoencoders (SAEs) can identify a canonical, atomic set of features for mechanistic interpretability. It introduces SAE stitching to classify latents from larger SAEs as novel or reconstruction-related, demonstrating that larger SAEs add new information beyond what smaller SAEs capture. It also introduces meta-SAEs to decompose decoder directions into interpretable meta-latents, revealing that larger SAE latents are often mixtures of smaller features and that meta-latents explain a substantial portion of variance in decoder directions (e.g., 55.47%). Collectively, the findings argue against a universal canonical unit of analysis and suggest a pragmatic, multi-width approach or alternative methods for identifying fundamental units. The work provides an interactive dashboard to explore meta-SAEs and emphasizes that interpretability tasks may require context-specific feature dictionaries rather than a single optimal SAE size.

Abstract

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/

Sparse Autoencoders Do Not Find Canonical Units of Analysis

TL;DR

The paper questions whether sparse autoencoders (SAEs) can identify a canonical, atomic set of features for mechanistic interpretability. It introduces SAE stitching to classify latents from larger SAEs as novel or reconstruction-related, demonstrating that larger SAEs add new information beyond what smaller SAEs capture. It also introduces meta-SAEs to decompose decoder directions into interpretable meta-latents, revealing that larger SAE latents are often mixtures of smaller features and that meta-latents explain a substantial portion of variance in decoder directions (e.g., 55.47%). Collectively, the findings argue against a universal canonical unit of analysis and suggest a pragmatic, multi-width approach or alternative methods for identifying fundamental units. The work provides an interactive dashboard to explore meta-SAEs and emphasizes that interpretability tasks may require context-specific feature dictionaries rather than a single optimal SAE size.

Abstract

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/

Paper Structure

This paper contains 39 sections, 10 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Decomposition of an SAE latent representing "Einstein" into a set of interpretable meta-latents. The edges connecting the nodes indicate shared activation by a meta-latent, with thicker lines representing stronger connections. It demonstrates the ability of meta-SAEs to uncover the underlying compositional structure of SAE latents, revealing how a complex concept can be represented as a sparse combination of meta-latents. We built a dashboard where you can explore all meta-latents: https://metasaes.streamlit.app
  • Figure 2: Example of composition of latents in SAEs of different sizes. The smaller SAE has six latents, three of which reconstruct shape features, and three of which reconstruct color features. Reconstructing a shape of a specific color requires two active latents (e.g. blue and square). On the other hand, the larger SAE has nine latents, each of which reconstructs a different color and shape combination. In the larger SAE, only a single active latent is required to reconstruct the colored shape (e.g. blue square). The sparsity penalty incentivizes larger SAEs to learn compositions of latents rather than atomic latents.
  • Figure 3: SAE stitching operation: connected subgraphs of latents can be swapped between SAEs based on cosine similarity
  • Figure 4: Change in MSE when adding each feature from GPT2-1536 to GPT2-768, plotted against the maximum cosine similarity of that feature to any feature in GPT2-768. Features with cosine similarity less than 0.7 tend to improve MSE, while more redundant features hurt performance. A few extreme outliers with very high cosine similarity and effect on MSE are not visible in this plot.
  • Figure 5: Interpolating between SAE pairs of increasing dictionary size (768→1536→3072→6144→12288) through two steps per phase: adding novel latents (increasing L0) then swapping groups of reconstruction latents (decreasing L0 on average). Both steps on average improve reconstruction (MSE). The L0 and MSE are averages over input samples.
  • ...and 18 more figures