Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
TL;DR
This work interrogates interpretability-focused sparse autoencoders by contrasting standard shallow SAEs with an unrolled Matching Pursuit-based SAE (MP-SAE). It demonstrates that one-shot inference in shallow SAEs biases dictionaries toward near-orthogonality, hindering correlated-feature discovery, while MP-SAE leverages residual-guided, sequential atom selection to build richer, hierarchical representations with monotonic reconstruction improvement. Across MNIST and large vision-model backbones, MP-SAE delivers higher expressivity, reveals coherent global structure alongside locally diverse atom selections, and supports progressive coarse-to-fine reconstruction. These findings offer a principled route to more interpretable, robust sparse representations in neural systems.
Abstract
Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.
