Table of Contents
Fetching ...

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, Zeynep Akata

TL;DR

This work extends sparse autoencoders to vision-language models to quantify neuron-level monosemanticity with a new MonoSemanticity score (MS). It demonstrates that SAEs yield more monosemantic neurons, particularly with wider and sparser latent representations, and validates MS against human judgments via a large user study. The authors also show that SAE-based interventions on the CLIP vision encoder can steer multimodal LLM outputs without modifying the language model, enabling targeted insertion or suppression of discovered concepts. The work provides a practical, unsupervised toolkit for improving interpretability and controllability in VLMs and offers benchmark data for future research.

Abstract

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in visual representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Further, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying language model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code and benchmark data are available at https://github.com/ExplainableML/sae-for-vlm.

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

TL;DR

This work extends sparse autoencoders to vision-language models to quantify neuron-level monosemanticity with a new MonoSemanticity score (MS). It demonstrates that SAEs yield more monosemantic neurons, particularly with wider and sparser latent representations, and validates MS against human judgments via a large user study. The authors also show that SAE-based interventions on the CLIP vision encoder can steer multimodal LLM outputs without modifying the language model, enabling targeted insertion or suppression of discovered concepts. The work provides a practical, unsupervised toolkit for improving interpretability and controllability in VLMs and offers benchmark data for future research.

Abstract

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in visual representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Further, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying language model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs. Code and benchmark data are available at https://github.com/ExplainableML/sae-for-vlm.

Paper Structure

This paper contains 25 sections, 17 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: Sparse Autoencoder (SAE) in VLM (e.g. CLIP): Top activating images of a neuron in a pretrained VLM layer are polysemantic (left), and those of a neuron in a sparse latent of SAE trained to reconstruct the same layer are monosemantic (right), according to MonoSemanticity score (MS).
  • Figure 2: Computation of our MonoSemanticity score (MS). (a) Embeddings and activations are extracted for a set of images (b) to compute the pairwise embedding similarities and pairwise neuron activations. (c) MS is the average of embedding similarities weighted by the neuron activations.
  • Figure 3: Top activating images of neurons with MonoSemanticity (MS) scores ranging from high (left) to low (right). Higher scores correlate with more similar images, reflecting monosemanticity.
  • Figure 4: Alignment Rate (AR, %) of humans with MS score when judging which neuron in a pair is more monosemantic, grouped by MS difference between the neurons. Bars show AR per interval; dots show cumulative AR up to that interval.
  • Figure 5: MonoSemanticity scores in decreasing order across neurons, normalized by width. Results are shown for the last layer of the model, without SAE ("No SAE", in black dashed line), and with SAE in straight lines using either (a) different expansion factors ($\varepsilon=1$, $\varepsilon=2$, $\varepsilon=4$, $\varepsilon=16$, $\varepsilon=64$) or (b) different sparsity levels ($K=1$, $K=10$, $K=20$, and $K=50$).
  • ...and 14 more figures