Table of Contents
Fetching ...

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs

Sihang Zhao, Youliang Yuan, Xiaoying Tang, Pinjia He

TL;DR

Preliminary exploration on how to mitigate laziness is conducted and it is found that chain of thought (CoT) can effectively address this issue.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate a strong understanding of the real world and can even handle complex tasks. However, they still fail on some straightforward visual question-answering (VQA) problems. This paper dives deeper into this issue, revealing that models tend to err when answering easy questions (e.g. Yes/No questions) about an image, even though they can correctly describe it. We refer to this model behavior discrepancy between difficult and simple questions as model laziness. To systematically investigate model laziness, we manually construct LazyBench, a benchmark that includes Yes/No, multiple choice, short answer questions, and image description tasks that are related to the same subjects in the images. Based on LazyBench, we observe that laziness widely exists in current advanced MLLMs (e.g. GPT-4o, Gemini-1.5-pro, Claude 3 and LLaVA-v1.5-13B), and it is more pronounced on stronger models. We also analyze the VQA v2 (LLaVA-v1.5-13B) benchmark and find that about half of its failure cases are caused by model laziness, which further highlights the importance of ensuring that the model fully utilizes its capability. To this end, we conduct preliminary exploration on how to mitigate laziness and find that chain of thought (CoT) can effectively address this issue.

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs

TL;DR

Preliminary exploration on how to mitigate laziness is conducted and it is found that chain of thought (CoT) can effectively address this issue.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate a strong understanding of the real world and can even handle complex tasks. However, they still fail on some straightforward visual question-answering (VQA) problems. This paper dives deeper into this issue, revealing that models tend to err when answering easy questions (e.g. Yes/No questions) about an image, even though they can correctly describe it. We refer to this model behavior discrepancy between difficult and simple questions as model laziness. To systematically investigate model laziness, we manually construct LazyBench, a benchmark that includes Yes/No, multiple choice, short answer questions, and image description tasks that are related to the same subjects in the images. Based on LazyBench, we observe that laziness widely exists in current advanced MLLMs (e.g. GPT-4o, Gemini-1.5-pro, Claude 3 and LLaVA-v1.5-13B), and it is more pronounced on stronger models. We also analyze the VQA v2 (LLaVA-v1.5-13B) benchmark and find that about half of its failure cases are caused by model laziness, which further highlights the importance of ensuring that the model fully utilizes its capability. To this end, we conduct preliminary exploration on how to mitigate laziness and find that chain of thought (CoT) can effectively address this issue.

Paper Structure

This paper contains 24 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: MLLMs sometimes fail to correctly answer straightforward Yes/No or multiple-choice questions based on images. However, they often manage to avoid these errors when describing the images. We refer to this phenomenon as "model laziness."
  • Figure 2: The green box represents a correct, brief statement about the "question subject" in the image. The blue box contains four different types of questions about this subject (Yes/No, multiple-choice, short-answer questions, and descriptive requests). They are used to evaluate the model's laziness, and the construction of these questions is described in Section \ref{['construction']}.
  • Figure 3: The Process of constructing LazyBench: we utilize CLIP radford2021learning to identify images that the model considers "similar" and analyze the differences between them to pinpoint instances where MLLMs provide incorrect answers. Based on these errors, we construct a series of related questions.
  • Figure 4: Examples of LLaVA-1.5-13B being lazy in VQA-v2. The first line of boxes below each image contains the original labels and questions in VQA-v2, as well as the initial responses from LLaVA-1.5-13B. The second line of boxes contains the statement and description request automatically generated by Doby. The last line contains the responses of LLaVA-1.5-13B to Doby's questions. Subsequently, by comparing these responses to the statement, it is determined whether the model is being lazy in these cases.
  • Figure 5: The statement is a brief statement about the "question subject" in the image. The conversed statement contradicts the "question subject". The irrelevant question is a Yes/No question unrelated to the image content, and the conversed Yes/No question is derived from the correct statement. They are used to ensure that the model does not thoughtlessly respond with "yes."
  • ...and 6 more figures