Table of Contents
Fetching ...

Decoding Vision Transformers: the Diffusion Steering Lens

Ryota Takatsuki, Sonia Joseph, Ippei Fujisawa, Ryota Kanai

TL;DR

This work tackles mechanistic interpretability for Vision Transformers (ViTs), where traditional Logit Lens methods struggle to capture rich visual representations. Building on Diffusion Lens, the authors introduce Diffusion Steering Lens (DSL), a training-free technique that steers internal representations toward targets and patches subsequent submodule outputs to isolate direct contributions from components like attention heads. Through interventional studies, DSL reliably highlights which heads directly influence final predictions, outperforming Diffusion Lens in identifying causal submodule effects. This approach provides a practical tool for interpretable analysis of ViTs and points to broader implications for understanding iterative refinement in visual transformers, while acknowledging steering artifacts and decoder-induced limitations as areas for future work.

Abstract

Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.

Decoding Vision Transformers: the Diffusion Steering Lens

TL;DR

This work tackles mechanistic interpretability for Vision Transformers (ViTs), where traditional Logit Lens methods struggle to capture rich visual representations. Building on Diffusion Lens, the authors introduce Diffusion Steering Lens (DSL), a training-free technique that steers internal representations toward targets and patches subsequent submodule outputs to isolate direct contributions from components like attention heads. Through interventional studies, DSL reliably highlights which heads directly influence final predictions, outperforming Diffusion Lens in identifying causal submodule effects. This approach provides a practical tool for interpretable analysis of ViTs and points to broader implications for understanding iterative refinement in visual transformers, while acknowledging steering artifacts and decoder-induced limitations as areas for future work.

Abstract

Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting them into the output vocabulary space. Although applying Logit Lens to Vision Transformers (ViTs) is technically straightforward, its direct use faces limitations in capturing the richness of visual representations. Building on the work of Toker et al. (2024)~\cite{Toker2024-ve}, who introduced Diffusion Lens to visualize intermediate representations in the text encoders of text-to-image diffusion models, we demonstrate that while Diffusion Lens can effectively visualize residual stream representations in image encoders, it fails to capture the direct contributions of individual submodules. To overcome this limitation, we propose \textbf{Diffusion Steering Lens} (DSL), a novel, training-free approach that steers submodule outputs and patches subsequent indirect contributions. We validate our method through interventional studies, showing that DSL provides an intuitive and reliable interpretation of the internal processing in ViTs.

Paper Structure

This paper contains 18 sections, 2 equations, 17 figures.

Figures (17)

  • Figure 1: Schematic of Diffusion Steering Lens. Visualizing the direct contribution of $h_m$ at layer $n$.
  • Figure 2: Diffusion Lens on Resid post: an example (cup). Around layer 30, the visualization starts resembling the input image.
  • Figure 3: Top 6 heads in similarity with input when using Diffusion Lens. Although a few heads in later layers reflect some contribution (e.g., L47H14: "snake"), most submodule outputs do not match the residual stream visualization.
  • Figure 4: Top 6 heads in similarity with input when using Diffusion Steering Lens. The DSL outputs more faithfully reflect the input, particularly for earlier heads (e.g., L36H4), highlighting its ability to visualize direct submodule contributions.
  • Figure 5: Images with overlays (flower)
  • ...and 12 more figures