Table of Contents
Fetching ...

Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba

TL;DR

This work interrogates interpretability-focused sparse autoencoders by contrasting standard shallow SAEs with an unrolled Matching Pursuit-based SAE (MP-SAE). It demonstrates that one-shot inference in shallow SAEs biases dictionaries toward near-orthogonality, hindering correlated-feature discovery, while MP-SAE leverages residual-guided, sequential atom selection to build richer, hierarchical representations with monotonic reconstruction improvement. Across MNIST and large vision-model backbones, MP-SAE delivers higher expressivity, reveals coherent global structure alongside locally diverse atom selections, and supports progressive coarse-to-fine reconstruction. These findings offer a principled route to more interpretable, robust sparse representations in neural systems.

Abstract

Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.

Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

TL;DR

This work interrogates interpretability-focused sparse autoencoders by contrasting standard shallow SAEs with an unrolled Matching Pursuit-based SAE (MP-SAE). It demonstrates that one-shot inference in shallow SAEs biases dictionaries toward near-orthogonality, hindering correlated-feature discovery, while MP-SAE leverages residual-guided, sequential atom selection to build richer, hierarchical representations with monotonic reconstruction improvement. Across MNIST and large vision-model backbones, MP-SAE delivers higher expressivity, reveals coherent global structure alongside locally diverse atom selections, and supports progressive coarse-to-fine reconstruction. These findings offer a principled route to more interpretable, robust sparse representations in neural systems.

Abstract

Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we compare them with an iterative SAE that unrolls Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.

Paper Structure

This paper contains 8 sections, 3 theorems, 12 equations, 9 figures, 1 algorithm.

Key Result

Proposition 8.1

Let ${\bm r}^{(t)}$ denote the residual at iteration $t$ of MP-SAE inference, and let $j^{(t)}$ be the index of the atom selected at step $t$. If the column $j^{(t)}$ of the dictionary ${\bm D}$ satisfy $\|{\bm D}_{j^{(t)}}\|_2 = 1$, then the residual becomes orthogonal to the previously selected at

Figures (9)

  • Figure 1: Matching Pursuit Sparse Autoencoders (MP-SAE)
  • Figure 2: Expressivity.
  • Figure 3: Feature Selection vs. Activation Levels. Top: atoms with highest activation frequency ($\ell_0$). Bottom: atoms with highest activation $\mathbb{E}[{\bm z}_j]$ ($\ell_1$).
  • Figure 4: Activation distributions.
  • Figure 5: Coherence analysis of learned concepts.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Proposition 8.1: Stepwise Orthogonality of MP Residuals
  • proof
  • Proposition 8.2: Monotonic Decrease of MP Residuals
  • proof
  • Proposition 8.3: Asymptotic Convergence of MP Residuals