Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov; George Lange; Neel Nanda

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, George Lange, Neel Nanda

TL;DR

This work tackles the challenge of grounding sparse dictionary-based interpretability in realistic models by introducing a principled evaluation framework that uses supervised feature dictionaries as benchmarks. Applying it to the IOI task with GPT-2 Small, the authors show that supervised dictionaries enable near-faithful reconstruction, precise attribute editing, and interpretable features, while unsupervised SAEs offer interpretability but limited control. They reveal two qualitative SAE phenomena—occlusion and over-splitting—and provide toy-model demonstrations, underscoring the need for principled training and objective evaluation. The study offers a concrete path toward objective, grounded assessments of sparse dictionary learning methods in large language models and highlights directions for improving SAE-based control and interpretability.

Abstract

Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

TL;DR

Abstract

Paper Structure (43 sections, 1 theorem, 36 equations, 31 figures)

This paper contains 43 sections, 1 theorem, 36 equations, 31 figures.

Introduction
Preliminaries
Overview and Motivation
Test 1: Sufficiency and Necessity of Dictionary Reconstructions for the Task
Test 2: Sparse Controllability of Attributes
Test 3: Interpretability
Computing and Validating Supervised Feature Dictionaries
Computing Supervised Feature Dictionaries
Evaluation Results
Evaluating Task-Specific and Full-Distribution Sparse Autoencoders
Methodology
Results
Qualitative Phenomena in SAE Learning
Feature Occlusion
Feature Over-splitting
...and 28 more sections

Key Result

Lemma A.1

Suppose that all conditional means $\mathbb{E}_{p\sim \mathcal{D}}\left[\mathbf{a}| a_i(p) = v\right]$ exist for all $i, v\in S_i$. Let $a_i$ be an attribute such its values appear independently from the values of all other attributes, i.e. Then, in the limit of infinite training data, the conditional means $\mathbb{E}\left[\mathbf{a}| a_i(p) = v\right]$ are all equal to the overall mean $\mathbb

Figures (31)

Figure 1: Overview of our evaluation pipeline. We begin by selecting a specific model capability and then disentangling model activations into capability-relevant features using supervision. Then, we evaluate a given feature dictionary w.r.t. this capability, using the supervised features as a benchmark. We test the extent to which (1) the feature dictionary's reconstructions of the activations are necessary and sufficient for the capability, (2) the features can be used to edit capability-relevant information in internal model representations (agnostic of feature interpretations), and (3) the features can be interpreted w.r.t. the capability in a manner consistent with their causal role.
Figure 2: Sufficiency (left) and necessity (right) evaluations of reconstructions of cross-sections of the IOI circuit computed using supervised feature dictionaries, task- and full-distribution SAEs. Left: average logit difference when replacing activations in cross-sections of the IOI circuit with their reconstructions, normalized by the average logit difference over the data distribution in the absence of intervention (a $y$-axis value of $1$ is best). Right: drop in logit difference when deleting reconstructions, normalized by the respective drop when performing mean-ablation, and linearly rescaled so that values close to 1 are best. See Appendix \ref{['app:ioi-supervised-details']} for details.
Figure 3: Accuracy when editing IO, S and Pos for circuit cross-sections using our supervised feature dictionaries and task-specific SAEs; the outcome in the absence of intervention is shown in blue for reference. When using task-specific SAEs, we edit either 2, 4 or 6 features (which means we in total add and/or remove up to that many features from activations). For comparison, supervised edits always involve removing 1 feature and adding 1 feature. Accuracy is measured as the proportion of examples for which the model's prediction agrees with the ground-truth prediction for the edit; see Section \ref{['subsection:sae-evaluation-methodology']} and Appendix \ref{['app:ioi-supervised-details']} for details.
Figure 4: Trade-offs between edit magnitude and edit success for attribute editing using task-specific SAEs for select IOI circuit cross-sections. The x-axis measures the weight (see Subsection \ref{['subsection:sae-evaluation-methodology']}) of the features removed by the edit (features added are not reported in this plot), averaged over the attention heads in the cross-section. This metric is affine-transformed so that a value of 0 indicates the weight removed by the corresponding supervised edit, and a value of 1 indicates that the edit removed all features in the reconstruction. The y-axis is an affine transform of the fraction of examples for which the edit results in the same next-token prediction as the ground-truth edit, with a value of $0$ corresponding to no intervention, and a value of $1$ corresponding to the supervised edits. Results for our interpretation-agnostic/interpretation-aware editing methods are shown as thick/dashed lines respectively. For both methods, we edit 2, 4 or 6 features (a higher magnitude score indicates editing more features).
Figure 5: Interpreting the IOI features learned by SAEs trained on OpenWebText. For each node in the IOI circuit, we show the distribution of interpretations for the features which have any interpretation with $F_1$ score above a threshold. The numbers in the right column indicate the number of features with an assigned interpretation by our method, and the color bars show the overall distribution of the SAE features (conditioned on the feature not being dead on the SAE training distribution). See Section \ref{['section:sae-evaluation']} for methodology; details on the interpretations considered are given in Appendix \ref{['app:sae-interp-methodology']}.
...and 26 more figures

Theorems & Definitions (2)

Lemma A.1
proof

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

TL;DR

Abstract

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (31)

Theorems & Definitions (2)