Table of Contents
Fetching ...

VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

Hyesu Lim, Jinho Choi, Taekyung Kim, Byeongho Heo, Jaegul Choo, Dongyoon Han

TL;DR

This work introduces VisualScratchpad, an interactive interface for visual concept analysis during inference that applies sparse autoencoders to the vision encoder and links the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model.

Abstract

High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token-latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross-modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/

VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

TL;DR

This work introduces VisualScratchpad, an interactive interface for visual concept analysis during inference that applies sparse autoencoders to the vision encoder and links the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model.

Abstract

High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token-latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross-modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/
Paper Structure (20 sections, 10 figures)

This paper contains 20 sections, 10 figures.

Figures (10)

  • Figure 1: VisualScratchpad pipeline.A. During inference in a vision-language model, we extract the intermediate representation z from the vision encoder. B. A sparse autoencoder processes z to produce concept activations. The attention map from output text tokens to image tokens is applied at the patch level to weight these activations. Latents exhibiting similar activation patterns across output tokens are then clustered and visualized in a token–latent heatmap. C. The causal influence of these concepts on the model’s output can be evaluated through latent ablation.
  • Figure 2: Attention-based concept re-ranking. A. SAEs return latent activations for each image patch. B. Image-level activations can be computed by naïvely averaging activations across all patches, or C. by applying a weighted average where the text-to-image attention map serves as the weighting coefficient, promoting concepts relevant to the text tokens to the top of the ranking. The bottom row shows the top-ranked concept obtained from the corresponding method.
  • Figure 3: Case studies.A. Limited cross-alignment, B. misleading visual cues, and C. unused hidden cues can cause wrong answer.
  • Figure 4: Complexity of visual concepts across layers. Early layers capture simple visual components, including position and color, mid layers capture object-level shape and patterns, and late layers understand scene-level events.
  • Figure 5: Token-Latent heatmap visualization. Raw values are difficult to analyze, so we normalize in column-wise to show if the latent is specifically attended by certain tokens or by overall tokens. Moreover, we cluster and sort by activation correlation in column-wise, using hierarchical clustering.
  • ...and 5 more figures