Table of Contents
Fetching ...

Interpretable and Testable Vision Features via Sparse Autoencoders

Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su

TL;DR

The paper proposes sparse autoencoders as a practical, model-agnostic bridge between concept discovery and causal probing in vision models. By training SAEs on frozen ViT activations, each sparse feature yields real-image exemplars and a decoding vector that enables precise, testable interventions without retraining the backbone. Applying this to CLIP and DINOv2 reveals language supervision fosters cultural and abstract semantic abstractions absent in purely visual models, and SAEs provide causal validation across classification and semantic segmentation tasks. The work advocates for a falsifiable, interactive interpretability framework and provides code and demos to encourage broader exploration, while acknowledging qualitative limitations and the need for further methodological development.

Abstract

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc tools supply both in a single, model-agnostic procedure. We use sparse autoencoders (SAEs) to bridge this gap; each sparse feature comes with real-image exemplars that reveal its meaning and a decoding vector that can be manipulated to probe its influence on downstream task behavior. By applying our method to widely-used pre-trained vision models, we reveal meaningful differences in the semantic abstractions learned by different pre-training objectives. We then show that a single SAE trained on frozen ViT activations supports patch-level causal edits across tasks (classification and segmentation) all without retraining the ViT or task heads. These qualitative, falsifiable demonstrations position SAEs as a practical bridge between concept discovery and causal probing of vision models. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/saev.

Interpretable and Testable Vision Features via Sparse Autoencoders

TL;DR

The paper proposes sparse autoencoders as a practical, model-agnostic bridge between concept discovery and causal probing in vision models. By training SAEs on frozen ViT activations, each sparse feature yields real-image exemplars and a decoding vector that enables precise, testable interventions without retraining the backbone. Applying this to CLIP and DINOv2 reveals language supervision fosters cultural and abstract semantic abstractions absent in purely visual models, and SAEs provide causal validation across classification and semantic segmentation tasks. The work advocates for a falsifiable, interactive interpretability framework and provides code and demos to encourage broader exploration, while acknowledging qualitative limitations and the need for further methodological development.

Abstract

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc tools supply both in a single, model-agnostic procedure. We use sparse autoencoders (SAEs) to bridge this gap; each sparse feature comes with real-image exemplars that reveal its meaning and a decoding vector that can be manipulated to probe its influence on downstream task behavior. By applying our method to widely-used pre-trained vision models, we reveal meaningful differences in the semantic abstractions learned by different pre-training objectives. We then show that a single SAE trained on frozen ViT activations supports patch-level causal edits across tasks (classification and segmentation) all without retraining the ViT or task heads. These qualitative, falsifiable demonstrations position SAEs as a practical bridge between concept discovery and causal probing of vision models. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/saev.

Paper Structure

This paper contains 44 sections, 2 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Sparse autoencoders (SAEs) trained on pre-trained ViT activations discover a wide spread of features across both visual patterns and semantic structures. We show eight different features from an SAE trained on ImageNet-1K activations from a CLIP-trained ViT-B/16. Colored patches mark where the SAE features fire within an image; each SAE feature fires on on semantically consistent but visually diverse patches.
  • Figure 1: SAE metrics on ImageNet-1K. Low reconstruction error (mean-squared error; MSE) and sparse activations demonstrate successful decomposition of ViT representations. "Dead" neurons are active on less than $10^{-7}\%$ of inputs; "Dense" neurons are active on more than $1\%$ of all inputs.
  • Figure 2: Given a picture and a set of highlighted patches, we find exemplar images by (1) getting ViT activations for each patch, (2) computing a sparse representation for each highlighted patch (\ref{['eq:enc', 'eq:act']}), (3) summing over sparse representations, (4) choosing the top $k$ features by activation magnitude and (5) finding existing images that maximize these features.
  • Figure 3: CLIP learns robust cultural visual features. Top Left (a): A "Brazil" feature (CLIP-24K/6909) responds to distinctive Brazilian imagery including Rio de Janeiro's urban landscape, the national flag, and the iconic sidewalk tile pattern of Copacabana Beach Top Right (b):CLIP-24K/6909 does not respond to other South American symbols like Machu Picchu or the Argentinian flag. Bottom Left (c): We search DINOv2's SAE for a similar "Brazil" feature and find that DINOv2-24K/9823 fires on Brazilian imagery. Bottom Right (d): However, maximally activating ImageNet-1K examples for DINOv2-24K/9823 are of lamps, convincing us that DINOv2-24K/9823 does not reliably detect Brazilian cultural symbols.
  • Figure 4: MSE and L0 on the training dataset for all learning rates and sparsity coefficents $\lambda$.
  • ...and 11 more figures