Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov; Andrey Galichin; Alexey Dontsov; Oleg Rogov; Ivan Oseledets; Elena Tutubalina

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina

TL;DR

It is shown that SAEs in their current state do not reliably decompose models'internal mechanisms, and they fail at their core task even when reconstruction is strong.

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

TL;DR

It is shown that SAEs in their current state do not reliably decompose models'internal mechanisms, and they fail at their core task even when reconstruction is strong.

Abstract

of true features despite achieving

explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Paper Structure (29 sections, 9 equations, 18 figures, 4 tables)

This paper contains 29 sections, 9 equations, 18 figures, 4 tables.

Introduction
Background
Sparse Autoencoders for Interpretability
Model Architecture and Decomposition
Training Objective
SAE Architectural Variants
Critical Perspectives on SAEs
Case Study #1: Toy Model Experiments
Experimental setup
Synthetic Data Generation
Results
Case Study #2: Validating SAEs on LLMs
Experimental setup
Results
Limitations
...and 14 more sections

Figures (18)

Figure 1: Frozen SAE baselines and their performance. (Left) Conceptual diagrams: Frozen Decoder SAE (decoder weights fixed at random initialization), Soft-Frozen Decoder SAE (decoder weights initialized randomly and constrained to maintain CosineSim $\geq$ 0.8 with their initial values throughout training), and Frozen Encoder SAE (encoder weights fixed at random initialization). (Right) For BatchTopK SAE (L0=160), these baselines remain competitive with fully trained SAE across four key evaluation metrics, challenging the assumption that strong performance indicates meaningful feature learning.
Figure 2: SAEs performance on constant probability setting. Both BatchTopK and JumpReLU SAEs achieve high reconstruction fidelity (Explained Variance = 0.67), yet recover almost none of the ground‑truth features in this simplest aligned setting.
Figure 3: SAEs performance on variable probability setting. Both SAE architectures achieve high reconstruction fidelity (explained variance = 0.71), yet recover only the highest-frequency ground‑truth features.
Figure 4: Explained Variance. Despite training with frozen components, naive baselines achieve high reconstruction performance, with Soft-Frozen SAEs matching fully-trained ReLU SAEs and losing only 6% relative to their original variants.
Figure 5: AutoInterp score distribution. For both SAE architectures, frozen baselines achieve high AutoInterp scores, with the Soft-Frozen variant matching original performance, suggesting interpretability can emerge without learned feature alignment.
...and 13 more figures

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

TL;DR

Abstract

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)