Table of Contents
Fetching ...

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, Jieping Ye

TL;DR

This paper proposes integrating attention analysis with LLaVA-CAM, concretely, attention scores highlight relevant regions during forward propagation, while LLaVA-CAM captures gradient changes through backward propagation, revealing key image features.

Abstract

Large Vision Language Models (LVLMs) achieve great performance on visual-language reasoning tasks, however, the black-box nature of LVLMs hinders in-depth research on the reasoning mechanism. As all images need to be converted into image tokens to fit the input format of large language models (LLMs) along with natural language prompts, sequential visual representation is essential to the performance of LVLMs, and the information flow analysis approach can be an effective tool for determining interactions between these representations. In this paper, we propose integrating attention analysis with LLaVA-CAM, concretely, attention scores highlight relevant regions during forward propagation, while LLaVA-CAM captures gradient changes through backward propagation, revealing key image features. By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers. To validate our analysis, we conduct comprehensive experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks, and experimental results not only verify our hypothesis but also reveal a consistent pattern of information flow convergence in the corresponding layers, and the information flow cliff layer will be different due to different contexts. The paper's source code can be accessed from \url{https://github.com/zhangbaijin/From-Redundancy-to-Relevance}

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

TL;DR

This paper proposes integrating attention analysis with LLaVA-CAM, concretely, attention scores highlight relevant regions during forward propagation, while LLaVA-CAM captures gradient changes through backward propagation, revealing key image features.

Abstract

Large Vision Language Models (LVLMs) achieve great performance on visual-language reasoning tasks, however, the black-box nature of LVLMs hinders in-depth research on the reasoning mechanism. As all images need to be converted into image tokens to fit the input format of large language models (LLMs) along with natural language prompts, sequential visual representation is essential to the performance of LVLMs, and the information flow analysis approach can be an effective tool for determining interactions between these representations. In this paper, we propose integrating attention analysis with LLaVA-CAM, concretely, attention scores highlight relevant regions during forward propagation, while LLaVA-CAM captures gradient changes through backward propagation, revealing key image features. By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers. To validate our analysis, we conduct comprehensive experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks, and experimental results not only verify our hypothesis but also reveal a consistent pattern of information flow convergence in the corresponding layers, and the information flow cliff layer will be different due to different contexts. The paper's source code can be accessed from \url{https://github.com/zhangbaijin/From-Redundancy-to-Relevance}
Paper Structure (17 sections, 13 equations, 9 figures, 1 table)

This paper contains 17 sections, 13 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: It shows the information flow of tokens, from left to right are system tokens, image tokens, user tokens, and output tokens. There is a convergence of the information flow of the system token, image token, and user token towards the output token at the shallow layers. The convergence of the information flow of the system token and user token is much more obvious than the image token at the deep layers, which we can call the deep layers as information flow cliff layers.
  • Figure 2: (A) is the percentage of system tokens, image tokens, and prompt tokens on the attention weight of the answer. (B) is the attention map of system tokens, image tokens, prompt tokens, and answer tokens. It can be observed that the attention scores for image tokens decrease rapidly in layers 1-5 and stabilize in layers 6-31. The attention allocated to image tokens is significantly lower throughout these layers than system and user tokens. However, image and user tokens' attention scores increase rapidly at the 32nd layer.
  • Figure 3: The LLaVA-CAM results of LLM on ScienceQA dataset(Complex reasoning). The information flow of the image converges to the correct region in the early layers and diverges in the deeper layers, and then the information flow cliff layer begins to appear.
  • Figure 4: The LLaVA-CAM results of POPE pope and TextVQA textvqa (Common reasoning). It can be analyzed from the LLaVA-CAM diagram of VQA that when the model recognizes the confirmed recognition objects such as "horse, people", etc., It will focus on the corresponding areas from the first layer to the deeper layers until a cliff layer occurs and causes information flow to be sparse.
  • Figure 5: The truncating 576 image tokens experiments on three VQA datasets include POPE/TextVQA/ScienceQA/ and a caption dataset CHAIR dataset, where the red arrow represents the information flow cliff layer. LLaVA1.5-7B LLaVA, Intern-VL 7B internvl, and Qwen-VL 7B bai2023qwen all conform to the pattern of information flow convergence in the early layer and dispersion in the deep layer. Deeper layers can exhibit cliff layers, where truncating image tokens no longer affects the model's accuracy.
  • ...and 4 more figures