Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, Talia Konkle
TL;DR
The paper tackles the instability of Sparse Autoencoders for unsupervised concept extraction in large vision models by introducing Archetypal SAEs (A-SAE), which constrain dictionary atoms to lie in the data convex hull, boosting stability. A relaxed variant (RA-SAE) preserves reconstruction quality while maintaining stability, achieved via a distillation step that uses a subset C of data and a mild relaxation term. The authors develop novel evaluation metrics and identifiability-inspired benchmarks to rigorously assess dictionary plausibility, identifiability, and structure, demonstrating that RA-SAE yields more structured, semantically meaningful concepts across diverse architectures. They also provide a scalable implementation and show broad applicability to state-of-the-art vision models, paving the way for more reliable concept discovery in large-scale models.
Abstract
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
