Language Models Can Explain Visual Features via Steering

Javier Ferrando; Enrique Lopez-Cuena; Pablo Agustin Martin-Torres; Daniel Hinjos; Anna Arias-Duart; Dario Garcia-Gasulla

Language Models Can Explain Visual Features via Steering

Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias-Duart, Dario Garcia-Gasulla

Abstract

Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it ``sees'', effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.

Language Models Can Explain Visual Features via Steering

Abstract

Paper Structure (44 sections, 10 equations, 14 figures, 6 tables)

This paper contains 44 sections, 10 equations, 14 figures, 6 tables.

Introduction
Extracting Features
Automatically Interpreting Features
Top-k Explanations
Proposed Approach
Steering-based Explanations.
Steering-informed Top-k Explanations.
Evaluating the Quality of the Explanations
Evaluation Metrics
Simulation-based Evaluation.
CLIP-based Evaluation.
Synthetic-image-based Evaluation.
Experimental Setup
Results
Explaining through Steering
...and 29 more sections

Figures (14)

Figure 1: Top: A vision feature extracted with an SAE is explained based on top-activating images, which are passed to the VLM to generate an explanation based on correlated visual evidence. Bottom: We propose to automatically obtain explanations of SAE features by causally intervening (steering) a vision encoder. The intervention is done after feeding it an information-devoid white image, effectively making the language model articulate what visual concept that feature represents.
Figure 2: Middle layer SAE synthetic-image-based evaluation scores of Top-k method as a function of the similarity with Steering Explanations.
Figure 2: Count and percentage of 'background' explanations turned 'animal' explanations by different methods (see main text for details).
Figure 3: Gemma 3 synthetic-image-based evaluation scores of Steering method as a function of the size of the LM $\text{m}_{\text{subj}}$.
Figure 4: Example of Top-k explanation exhibiting contextual bias.
...and 9 more figures

Language Models Can Explain Visual Features via Steering

Abstract

Language Models Can Explain Visual Features via Steering

Authors

Abstract

Table of Contents

Figures (14)