Table of Contents
Fetching ...

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba

TL;DR

This work questions the sufficiency of the Linear Representation Hypothesis (LRH) for interpreting neural representations and introduces MP-SAE, a Matching Pursuit–unrolled sparse autoencoder that enforces conditional orthogonality through residual-guided, sequential encoding. By testing on synthetic hierarchical data and pretrained vision-language models, MP-SAE captures hierarchical, nonlinear, and multimodal features that standard SAEs miss, while offering adaptive inference-time sparsity and monotonic reconstruction improvement. The results suggest that interpretability methods should align with the phenomenology of representations, not just linear decomposability, and highlight practical benefits for modular, multilayered, and multimodal feature discovery. Overall, MP-SAE provides a principled tool for uncovering richer structure in neural representations and offers robustness advantages for variable sparsity without retraining.

Abstract

Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

TL;DR

This work questions the sufficiency of the Linear Representation Hypothesis (LRH) for interpreting neural representations and introduces MP-SAE, a Matching Pursuit–unrolled sparse autoencoder that enforces conditional orthogonality through residual-guided, sequential encoding. By testing on synthetic hierarchical data and pretrained vision-language models, MP-SAE captures hierarchical, nonlinear, and multimodal features that standard SAEs miss, while offering adaptive inference-time sparsity and monotonic reconstruction improvement. The results suggest that interpretability methods should align with the phenomenology of representations, not just linear decomposability, and highlight practical benefits for modular, multilayered, and multimodal feature discovery. Overall, MP-SAE provides a principled tool for uncovering richer structure in neural representations and offers robustness advantages for variable sparsity without retraining.

Abstract

Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.

Paper Structure

This paper contains 47 sections, 4 theorems, 26 equations, 22 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.1

Let ${\bm r}^{(t)}$ be the residual at iteration $t$ of MP-SAE inference, and let $\bm{D}_{j^{(t-1)}}$ be the feature selected at step $t{-}1$. Then:

Figures (22)

  • Figure 1: Conceptual organization in neural representations.A)Linearly accessible concepts: abstract directions that are approximately orthogonal and independently interpretable, as assumed by the Linear Representation Hypothesis (LRH). B)Hierarchical concepts: representations structured in parent–child relations. C)Nonlinear, multidimensional, and temporally structured concepts: features that cannot be accessed via a single direction.
  • Figure 2: Illustrative Example of Conditional vs. Quasi-Orthogonality.A) Example of a hierarchical concept tree. B) Comparison of quasi-orthogonality (interference within levels) vs. conditional orthogonality (orthogonality across levels). C) Correlation matrix of features sampled from A, showing conditional orthogonality (white, $=0$) across levels and quasi-orthogonality (light blue, $=\varepsilon$) within levels.
  • Figure 3: Matching Pursuit Sparse Autoencoders (MP-SAE)
  • Figure 4: Evaluating SAE on a hierarchical tree with controlled within-level similarity.A) Correlation matrices for one similarity setting. Left shows the ground-truth matrix; the top row displays $\bm{D}^\top \bm{D}$ (self-similarity of learned features) and bottom row shows $\bm{D}_\text{GT}^\top \bm{D}$ (alignment with ground truth). Bottom: Quantitative evaluation across varying levels of within-group correlation, median over 10 runs is reported. B) Flat MSE captures the deviation from the ground-truth intra-level correlation. C) Hierarchical MSE quantifies unintended correlations across levels.
  • Figure 5: MP-SAE recovers more expressive features than standard SAEs. Reconstruction performance ($R^2$) as a function of sparsity level across four pretrained vision models: SigLIP, DINOv2, CLIP, and ViT. MP-SAE consistently achieves higher $R^2$ at comparable sparsity, indicating more efficient and informative decompositions.
  • ...and 17 more figures

Theorems & Definitions (10)

  • Definition 2.1: Linear Representation Hypothesis (LRH)
  • Definition 2.2: Sparse Autoencoders
  • Definition 2.3: Conditional Orthogonality
  • Proposition 3.1: Stepwise Orthogonality of MP Residuals
  • Definition 4.1: Hierarchical Generative Process
  • Proposition C.1: Stepwise Orthogonality of MP Residuals
  • proof
  • Proposition C.2: Monotonic Decrease of MP Residuals
  • proof
  • Proposition C.3: Asymptotic Convergence of MP Residuals