Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao
TL;DR
This work tackles the interpretability gap in multimodal LVLMs and the computational overhead of processing image tokens by introducing Simignore, a similarity-driven image token reduction method. By observing information flow in the LLM decoder, the authors identify that image tokens semantically aligned with text are more influential, and they operationalize this through a pipeline that aligns image and text embeddings, selects the top-K text-relevant image tokens, and masks the rest. Through extensive experiments on ScienceQA with backbones like LLaVA1.5 and Mipha, Simignore yields measurable gains in complex reasoning accuracy and reduces inference time, with cosine similarity performing best among tested metrics. The approach demonstrates that filtering image tokens based on text relevance enhances reasoning while reducing computation, and the authors provide analysis via attention visualizations, clustering, and ablations to support the findings.
Abstract
Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper's source code can be accessed from \url{https://github.com/FanshuoZeng/Simignore}.
