Table of Contents
Fetching ...

Steering CLIP's vision transformer with sparse autoencoders

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, Blake Aaron Richards

TL;DR

This work introduces sparse autoencoders trained on CLIP's vision transformer to reveal interpretable, steerable features and quantify their control over CLIP's outputs. By defining and validating steerability metrics, the authors show that a meaningful subset of SAE features can precisely steer predictions and substantially increase the accessible concept space compared to neuron-level control. They demonstrate practical benefits by suppressing spurious correlations and defending against typographic attacks, attaining state-of-the-art performance on typographic defenses and improved disentanglement on CelebA and Waterbirds. The results highlight fundamental differences between vision and language processing in CLIP and provide a scalable toolkit for mechanistic interpretability and robust vision-language systems.

Abstract

While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.

Steering CLIP's vision transformer with sparse autoencoders

TL;DR

This work introduces sparse autoencoders trained on CLIP's vision transformer to reveal interpretable, steerable features and quantify their control over CLIP's outputs. By defining and validating steerability metrics, the authors show that a meaningful subset of SAE features can precisely steer predictions and substantially increase the accessible concept space compared to neuron-level control. They demonstrate practical benefits by suppressing spurious correlations and defending against typographic attacks, attaining state-of-the-art performance on typographic defenses and improved disentanglement on CelebA and Waterbirds. The results highlight fundamental differences between vision and language processing in CLIP and provide a scalable toolkit for mechanistic interpretability and robust vision-language systems.

Abstract

While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.

Paper Structure

This paper contains 43 sections, 12 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Our method improves performance on vision disentanglement tasks by detecting and suppressing features with CLIP SAEs. For CelebA, we suppress blondeness to improve gender classification. For details, see Section \ref{['ssec:supress_sc']}.
  • Figure 2: The top activating images show that base net features are polysemantic, while SAE features capture task-relevant attributes: blondeness (CelebA), land/water backgrounds (Waterbirds), and typographic images (typographic attacks). Feature selection details are in Section \ref{['ssec:supress_sc']} and more examples are in Appendix \ref{['app:more_max_images']}, Figure \ref{['fig:FULL_SAE_neuron_max_act_images']}.
  • Figure 3: Typographic Attack on CLIP: On the left, an ImageNet-100 sample. On the right, the same image with 'tiger' written on it. As demonstrated by goh2021multimodal, this simple text overlay can mislead CLIP's zero-shot classification towards the attacker's intended label.
  • Figure 4: A visualization of the L0 values for an x64 vanilla SAE trained on all patches of CLIP-B-32 for Layer 0. a) A heatmap of average L0 per patch, overlaid on the original image grid, shows that there is a bias toward the center. The center bias remains constant for all layers (see Appendix \ref{['app:l0_comparison_details']}). b) A box plot of L0s per patch reflects high-norm spatial tokens and a low-norm CLS token. c) A comparison between the L0s SAEs trained on the residual stream of GPT-2 and CLIP.
  • Figure 5: Asymptotic Feature Steerability Plot showing $\Delta P_f$ (dotted) and $\mathcal{S}_f$ (solid) versus steering strength. The "dragon" feature achieves perfect steering to a single concept, while "tree" and "apache" have similar $\mathcal{S}_f$ but different $\Delta P_f$ - 'tree' steers precisely to tree concepts, whereas "apache" disperses across helicopter-related concepts (e.g., "aircraft", "aviation", "rescue").
  • ...and 5 more figures