Table of Contents
Fetching ...

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, Talia Konkle

TL;DR

The paper tackles the instability of Sparse Autoencoders for unsupervised concept extraction in large vision models by introducing Archetypal SAEs (A-SAE), which constrain dictionary atoms to lie in the data convex hull, boosting stability. A relaxed variant (RA-SAE) preserves reconstruction quality while maintaining stability, achieved via a distillation step that uses a subset C of data and a mild relaxation term. The authors develop novel evaluation metrics and identifiability-inspired benchmarks to rigorously assess dictionary plausibility, identifiability, and structure, demonstrating that RA-SAE yields more structured, semantically meaningful concepts across diverse architectures. They also provide a scalable implementation and show broad applicability to state-of-the-art vision models, paving the way for more reliable concept discovery in large-scale models.

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

TL;DR

The paper tackles the instability of Sparse Autoencoders for unsupervised concept extraction in large vision models by introducing Archetypal SAEs (A-SAE), which constrain dictionary atoms to lie in the data convex hull, boosting stability. A relaxed variant (RA-SAE) preserves reconstruction quality while maintaining stability, achieved via a distillation step that uses a subset C of data and a mild relaxation term. The authors develop novel evaluation metrics and identifiability-inspired benchmarks to rigorously assess dictionary plausibility, identifiability, and structure, demonstrating that RA-SAE yields more structured, semantically meaningful concepts across diverse architectures. They also provide a scalable implementation and show broad applicability to state-of-the-art vision models, paving the way for more reliable concept discovery in large-scale models.

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Paper Structure

This paper contains 39 sections, 4 theorems, 40 equations, 14 figures, 4 tables.

Key Result

Proposition 6.1

Given $\bm{A} \in \mathbb{R}^{n \times d}$ as a set of $n$ data points and $\bm{W} \in \Omega_{k,n}$ as any row-stochastic matrix, parameterizing $\bm{D} = \bm{W} \bm{A}$ ensures that each concept $\bm{D}_i$ lies within the convex hull of the data, i.e., $\bm{D}_i \in \mathrm{conv}(\bm{A})$ for all

Figures (14)

  • Figure 1: A) Archetypal-SAE. Archetypal-SAEs constrain dictionary atoms (decoder directions) to the data’s convex hull, improving stability. A relaxed variant (RA-SAE) allows mild relaxation, matching standard SAEs in reconstruction while maintaining stability. Both integrate with any SAE variant (e.g., TopK, JumpReLU). B) Instability Problem. Standard SAEs produce inconsistent dictionaries across runs, undermining interpretability. For example, in classical SAEs, the second most important concept for "rabbit" in one run has no counterpart in another run ($\cos = 0.58$). In contrast, Archetypal-SAEs maintain consistent concept correspondences across runs, ensuring stability.
  • Figure 2: SAEs are a promising direction for scalable concept extraction in vision. Comparison of reconstruction error ($\ell_2$ Loss) and sparsity across four large-scale vision models: ConvNext, DINO, SigLIP, and ViT. The figure compares the performance of various dictionary learning methods, including classical approaches (Convex-NMF, Semi-NMF) and modern Sparse Autoencoders (Vanilla SAE, Top-K SAE, JumpReLU SAE). Each SAE is trained up to 250 million tokens per epoch over 50 epochs, demonstrating the scalability of SAEs and their ability to achieve superior trade-offs between reconstruction fidelity and sparsity compared to traditional methods.
  • Figure 3: Stability-Reconstruction tradeoff (optimal: top-left). We implement 5 dictionary learning methods on 4 models at 5 levels of sparsity each, as well as our A-SAE method. We show that SAEs exhibit instability (minor perturbations in the dataset can lead to significant changes in the learned dictionary), while traditional dictionary learning methods are more stable but worse at reconstructing the data. Archetypal-SAEs (ours) help mitigate this issue. We measure stability based on \ref{['eq:stability']}: the optimal average cosine similarity between the dictionaries across 4 runs after finding the best alignment via the Hungarian algorithm. Archetypal-SAEs improve stability without compromising reconstruction fidelity, performing better on the stability-reconstruction tradeoff than existing methods.
  • Figure 4: Impact of the Relaxation Parameter ($\delta$). Enumerating extreme points is infeasible in practice; therefore, we introduce a small relaxation parameter $(\delta)$ that allows exploration beyond the convex hull of $\mathbf{C}$. The magnitude of this relaxation enables the Archetypal SAE to achieve performance comparable to the unconstrained TopK SAE denoted as Baseline (left) while maintaining excellent stability (right).
  • Figure 5: Pseudocode for Relaxed Archetypal SAE (RA-SAE). This implementation ensures that dictionary atoms remain close to convex hull of the data $\mathrm{conv}(\bm{C})$ while allowing controlled deviations for better flexibility.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Proposition 6.1: Archetypal Dictionary, Convex and Conic Hulls
  • proof
  • Proposition 6.2: Geometric Stability of Archetypal Dictionaries
  • proof : Proof
  • Proposition 6.3: Rank Bound of Archetypal Dictionaries
  • proof
  • Proposition 6.4: OOD Measure with Non-Interfering Archetypes
  • proof