Table of Contents
Fetching ...

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang, Xianfeng Wu, Feilong Tang, Muyang He, Jian Li, Zheng Liu, Harry Yang, Sernam Lim, Bo Zhao

TL;DR

The paper investigates why multimodal LLMs can understand visual content yet give incorrect answers. It introduces MMVU, a benchmark with paired positive/negative questions across 12 categories to disentangle understanding from misdirection, and defines MR and RA to quantify robustness. It then builds a 112k-sample MMVU-Train via an information-extraction pipeline and proposes two prompting strategies, CGR and VAR, to align decoding with visual content. Across 15 MLLMs, MMVU reveals strong vulnerability to misleading prompts, and training with MMVU-Train plus CGR/VAR substantially improves robustness and accuracy on both MMVU and general benchmarks. The work provides concrete data-generation and prompting techniques to reduce hallucination in visually grounded reasoning, with implications for safer, more reliable multimodal AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that 'directly' relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs' attention to visual tokens is notably lower than to system and question tokens. We further observe that attention scores between questions and visual tokens as well as the model's confidence in the answers are lower in response to misleading questions than to straightforward ones. To address the first challenge, we introduce a paired positive and negative data construction pipeline to diversify the dataset. For the second challenge, we propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt. For the text prompt, we propose a content guided refinement strategy that performs preliminary visual content analysis to generate structured information before answering the question. Additionally, we employ a visual attention refinement strategy that highlights question-relevant visual tokens to increase the model's attention to visual content that aligns with the question. Extensive experiments demonstrate that these challenges can be significantly mitigated with our proposed dataset and techniques.

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

TL;DR

The paper investigates why multimodal LLMs can understand visual content yet give incorrect answers. It introduces MMVU, a benchmark with paired positive/negative questions across 12 categories to disentangle understanding from misdirection, and defines MR and RA to quantify robustness. It then builds a 112k-sample MMVU-Train via an information-extraction pipeline and proposes two prompting strategies, CGR and VAR, to align decoding with visual content. Across 15 MLLMs, MMVU reveals strong vulnerability to misleading prompts, and training with MMVU-Train plus CGR/VAR substantially improves robustness and accuracy on both MMVU and general benchmarks. The work provides concrete data-generation and prompting techniques to reduce hallucination in visually grounded reasoning, with implications for safer, more reliable multimodal AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have displayed remarkable performance in multi-modal tasks, particularly in visual comprehension. However, we reveal that MLLMs often generate incorrect answers even when they understand the visual content. To this end, we manually construct a benchmark with 12 categories and design evaluation metrics that assess the degree of error in MLLM responses even when the visual content is seemingly understood. Based on this benchmark, we test 15 leading MLLMs and analyze the distribution of attention maps and logits of some MLLMs. Our investigation identifies two primary issues: 1) most instruction tuning datasets predominantly feature questions that 'directly' relate to the visual content, leading to a bias in MLLMs' responses to other indirect questions, and 2) MLLMs' attention to visual tokens is notably lower than to system and question tokens. We further observe that attention scores between questions and visual tokens as well as the model's confidence in the answers are lower in response to misleading questions than to straightforward ones. To address the first challenge, we introduce a paired positive and negative data construction pipeline to diversify the dataset. For the second challenge, we propose to enhance the model's focus on visual content during decoding by refining the text and visual prompt. For the text prompt, we propose a content guided refinement strategy that performs preliminary visual content analysis to generate structured information before answering the question. Additionally, we employ a visual attention refinement strategy that highlights question-relevant visual tokens to increase the model's attention to visual content that aligns with the question. Extensive experiments demonstrate that these challenges can be significantly mitigated with our proposed dataset and techniques.
Paper Structure (51 sections, 3 equations, 24 figures, 17 tables, 1 algorithm)

This paper contains 51 sections, 3 equations, 24 figures, 17 tables, 1 algorithm.

Figures (24)

  • Figure 1: (a) Examples illustrating MLLMs can accurately understand the visual content but provide incorrect responses. In each example, we show a pair of so-called positive and negative questions. The model can answer the positive questions (indicated as green), demonstrating that it understands the image, but fails to generate the correct answers on the negative question (indicated as red). This paper investigates this phenomenon. (b) Comparison of the model's accuracy in response to positive and negative questions. (c) Performance comparison of MLLMs after fine-tuning with our dataset. The metric utilized here is the Response Accuracy (RA) in Sec. \ref{['Evaluation metrics']}.
  • Figure 2: The MMVU dataset consists of a benchmarking dataset for evaluating models as well as a training dataset. The former is curated by human annotators together with the appropriate metrics and analysis on the MLLM's attention and logit behavior. Based on the experiments, we propose a data construction pipeline to build a training dataset and prompting strategies to enhance the accuracy of MLLM responses.
  • Figure 3: Examples in the MMVU benchmark. POS and NEG denote the positive and negative questions, ANS denotes the answer.
  • Figure 4: Please refer to Sec. \ref{['subsec:analysis']} for details on the following calculations. (a) Statistical results of attention scores between the answer tokens and the system, visual, and question tokens. It is seen that the answer tokens pay the least attention to the visual tokens in general. (b) The ratio of the question tokens' attention to the system and visual tokens in negative samples versus positive samples. Negative questions appear to pay less attention than positive questions to visual tokens. (c) Procedure for calculating the ratio of output probabilities for negative and positive samples. (d) Comparison of the ratio of output probabilities for negative and positive samples across different MLLMs. Interestingly, it seems that a lower output probability correlates with lower attention between the question and visual tokens in (b).
  • Figure 5: Visualization of fine-tuning loss of different data composition (version 0).
  • ...and 19 more figures