Table of Contents
Fetching ...

Where do Large Vision-Language Models Look at when Answering Questions?

Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu

TL;DR

This work tackles the interpretability of LVLMs by extending heatmap-based visualization to open-ended, autoregressive outputs. It introduces visually relevant token selection via token-level log-likelihood ratios and adapts iGOS++–style heatmaps to LVLM architectures with multi-encoder and multi-resolution vision streams, aided by a single-mask, GNC-based optimization. Comprehensive experiments across state-of-the-art LVLMs and vision-centric benchmarks reveal that vision architecture strongly shapes attention patterns, while merely scaling the LLM has limited impact on focus, and that high accuracy does not always correlate with correct visual grounding. The study provides practical insights for evaluating and improving LVLM visual understanding beyond standard accuracy metrics, with code and data available for reproducibility and further development.

Abstract

Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

Where do Large Vision-Language Models Look at when Answering Questions?

TL;DR

This work tackles the interpretability of LVLMs by extending heatmap-based visualization to open-ended, autoregressive outputs. It introduces visually relevant token selection via token-level log-likelihood ratios and adapts iGOS++–style heatmaps to LVLM architectures with multi-encoder and multi-resolution vision streams, aided by a single-mask, GNC-based optimization. Comprehensive experiments across state-of-the-art LVLMs and vision-centric benchmarks reveal that vision architecture strongly shapes attention patterns, while merely scaling the LLM has limited impact on focus, and that high accuracy does not always correlate with correct visual grounding. The study provides practical insights for evaluating and improving LVLM visual understanding beyond standard accuracy metrics, with code and data available for reproducibility and further development.

Abstract

Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

Paper Structure

This paper contains 18 sections, 8 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Focus regions of LLaVA-1.5 when answering counting questions. The model may correctly focus on the relevant region and produce the correct answer (top left), or it may fail despite attending to the right region due to misinterpretation (top right). In some cases, incorrect focus leads to wrong answers (bottom right), while occasionally, the model answers correctly despite attending to irrelevant regions (bottom left), highlighting challenges in visual grounding and generalization.
  • Figure 2: Top figure: answers generated by LLaVA-1.5 given the original image and fully blurred baseline image. Bottom figure: conditional probability of the original answer given input image and baseline image. Most tokens in the response are not very dependent on the visual information.
  • Figure 3: Qualitative comparison of different explanation methods. The tokens in red denote the selected crucial tokens. Our method consistently generates meaningful heatmaps.
  • Figure 4: Comparison of the generated response and focus region of different LVLMs. Tokens in red are the selected visual relevant tokens.
  • Figure 5: Answer correctness and focus region plausibility across four quadrants. Each color stands for a different question category, including spatial, attribute, counting, global context, and reasoning questions.
  • ...and 7 more figures