Table of Contents
Fetching ...

What's in the Image? A Deep-Dive into the Vision of Vision Language Models

Omri Kaduri, Shai Bagon, Tali Dekel

TL;DR

A thorough empirical analysis is conducted, focusing on the attention modules across layers, by which it is revealed that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens.

Abstract

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally.(iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.

What's in the Image? A Deep-Dive into the Vision of Vision Language Models

TL;DR

A thorough empirical analysis is conducted, focusing on the attention modules across layers, by which it is revealed that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens.

Abstract

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally.(iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.

Paper Structure

This paper contains 21 sections, 5 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Fraction of attention to different token types: We measure the relative amount by which the generated tokens attend to: image tokens (blue), query text tokens (orange), and the previously generated tokens in the sequence (green). We report the distribution of relative attention for a set of 80 images, averaged across attention heads and generated tokens, for InternVL2-76B internvl1_5; see Fig. \ref{['fig:llava_attention_across_layers']} for results on LLaVA-1.5.
  • Figure 1: Evaluation on MME: The results cover 10 Perception tasks of the MME benchmark fu2023mme, illustrated in Fig. \ref{['fig:mme_dataset']}. Metrics include accuracy (ACC), ACC+ (percentage of images where all questions are correct), and the number of tokens used for reprompting. The first table reports results over the first six subsets (Existence, Count, Position, Color, OCR, Poster), while the second table covers the remaining four subsets (Celebrity, Artwork, Scene, Landmark), along with average across all subsets, and number of tokens used for re-prompting an image (i.e., asking more questions after "describe the image"). Results indicate that the K=5% compressed context achieves suffer only a slight decrease in performance with respect to Naive, while having at least 12x less tokens.
  • Figure 2: Analyzing visual information flow via attention knockout: (a) The VLM employs causal masking (Eq. \ref{['eq:causal_mask']}), allowing generated and query tokens to gather information from image tokens, but not vice versa. We analyze three knockout configurations: (b) Image-to-generated $\text{KO}_{\text{img}\rightarrow\text{gen}}$: visual information flows to generated tokens only through query tokens, (c) Image-to-query $\text{KO}_{\text{img}\rightarrow\text{txt}}$: blocks query tokens from accessing image information, and (d) Image-to-others $\text{KO}_{\text{img}\rightarrow\text{txt+gen}}$: blocks image tokens from affecting all other tokens. (e) Evaluation of model responses (see Sec. \ref{['sec:llmjudge']}) under each knockout configuration reveals that $\text{KO}_{\text{img}\rightarrow\text{gen}}$ achieves a 0.4 F1 score despite indirect image access, while $\text{KO}_{\text{img}\rightarrow\text{txt}}$ fails completely, demonstrating query tokens' essential role as global image descriptors. (f) We expand previous experiments by knocking out attention, starting from layer $l$. Results highlight a consistent drastic rise in F1 scores in the mid-layers, suggesting their critical role in visual information processing. See LLaVA-1.5 results in Fig. \ref{['fig:knockout-sm']}.
  • Figure 3: LLM-as-a-judge example. We compare the original VLM's response and a modified one. The LLM identifies the objects in each description and matches the two object lists; it then counts the TP (objects found in both descriptions), FN (omitted objects), and FP (hallucinated objects), and the F1 score is computed.
  • Figure 4: Visual attention across layers: The input images (a) are fed to the VLM with the query text "describe the image". (b) Visualization of the attention between the generated tokens and each of the image tokens; attention is averaged over generated tokens and attention heads. Early and late layers exhibit outliers, while mid-layers attention maps are more spread out.
  • ...and 18 more figures