Table of Contents
Fetching ...

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda

TL;DR

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing systematically evaluates SAE probes for LLM activation interpretation across 113 binary datasets and four challenging regimes. By introducing the Quiver of Arrows robustness check and rigorous baselines, the study finds that SAEs rarely outperform baseline probes and sometimes offer only marginal interpretability insights. Latent analyses reveal both semantically meaningful and spurious features, while dataset quality issues such as mislabeled CoLA demonstrate the limits of interpretability signals. The work highlights the need for stringent baseline controls and cautious interpretation of SAE-based explanations in mechanistic interpretability research.

Abstract

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs' basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs' utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

TL;DR

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing systematically evaluates SAE probes for LLM activation interpretation across 113 binary datasets and four challenging regimes. By introducing the Quiver of Arrows robustness check and rigorous baselines, the study finds that SAEs rarely outperform baseline probes and sometimes offer only marginal interpretability insights. Latent analyses reveal both semantically meaningful and spurious features, while dataset quality issues such as mislabeled CoLA demonstrate the limits of interpretability signals. The work highlights the need for stringent baseline controls and cautious interpretation of SAE-based explanations in mechanistic interpretability research.

Abstract

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs' basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs' utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.

Paper Structure

This paper contains 48 sections, 2 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: SAE probes underperform the baseline of logistic regression in each regime when taking the mean across datasets. Additionally, we find that baseline methods can provide many of the interpretability insights of SAE probes.
  • Figure 2: Left: An illustration of our SAE probing method. We pass in training activation vectors from each class and train an $L_1$ regularized logistic regression probe on the latents that differ the most between classes. Right: We ensure robustness of our results with the "quiver of arrows" approach (see \ref{['sec:quiver']}): we add SAE regression into a set of methods, and see if the test accuracy of the best method (chosen by validation accuracy) increases.
  • Figure 3: For a given width, using higher L0 and constructing probes with a larger basis of latents ($k$) is more performant.
  • Figure 4: In standard conditions, when SAE probes are added to the quiver, we find a slight decrease in performance.
  • Figure 5: For three of the datasets in \ref{['tab:example_data']}, we visualize the performance when SAE probes are in the quiver (dashed) versus when they are not (solid) for the regimes of data scarcity (left), class imbalance (middle), and label noise (right). In all three regimes, we see that on average (bottom row), SAEs do not help.
  • ...and 17 more figures