Table of Contents
Fetching ...

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao

TL;DR

This work tackles the interpretability gap in multimodal LVLMs and the computational overhead of processing image tokens by introducing Simignore, a similarity-driven image token reduction method. By observing information flow in the LLM decoder, the authors identify that image tokens semantically aligned with text are more influential, and they operationalize this through a pipeline that aligns image and text embeddings, selects the top-K text-relevant image tokens, and masks the rest. Through extensive experiments on ScienceQA with backbones like LLaVA1.5 and Mipha, Simignore yields measurable gains in complex reasoning accuracy and reduces inference time, with cosine similarity performing best among tested metrics. The approach demonstrates that filtering image tokens based on text relevance enhances reasoning while reducing computation, and the authors provide analysis via attention visualizations, clustering, and ablations to support the findings.

Abstract

Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

TL;DR

This work tackles the interpretability gap in multimodal LVLMs and the computational overhead of processing image tokens by introducing Simignore, a similarity-driven image token reduction method. By observing information flow in the LLM decoder, the authors identify that image tokens semantically aligned with text are more influential, and they operationalize this through a pipeline that aligns image and text embeddings, selects the top-K text-relevant image tokens, and masks the rest. Through extensive experiments on ScienceQA with backbones like LLaVA1.5 and Mipha, Simignore yields measurable gains in complex reasoning accuracy and reduces inference time, with cosine similarity performing best among tested metrics. The approach demonstrates that filtering image tokens based on text relevance enhances reasoning while reducing computation, and the authors provide analysis via attention visualizations, clustering, and ablations to support the findings.

Abstract

Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.

Paper Structure

This paper contains 18 sections, 16 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The simplified structure of our method.
  • Figure 2: We find that the information flow converges in regions related to the option of prompt, such as mushroom and copepod.
  • Figure 3: The holistic framework for an approach to enhance complex reasoning in multimodal large language models through similarity computation between image and text embeddings. We map the embeddings of image token and prompt token to the same similarity metric space, a process that involves operations such as regularization. Here we compute their similarity values. Then we select the $K$ image tokens with the highest similarity and consider them important. For unselected tokens, we ignore them by setting their attention mask to $0$.
  • Figure 4: Influence rate of attention score about image tokens.
  • Figure 5: Distribution of image token and prompt token in cosine metric space.
  • ...and 2 more figures