Table of Contents
Fetching ...

Grounded Chain-of-Thought for Multimodal Large Language Models

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, Rongrong Ji

TL;DR

The paper tackles visual hallucination in multimodal large language models by introducing Grounded Chain-of-Thought (GCoT), a framework that grounds stepwise reasoning to visual evidence and coordinates. A new MM-GCoT dataset with ~24k examples and three task types supports training and evaluation of GCoT, using metrics for answer accuracy, grounding accuracy, and answer-grounding consistency. Empirical results across 12 MLLMs show current models struggle with grounding consistency, and that GCoT training improves visual-spatial reasoning and reduces inconsistencies, with transfer to open-world QA and grounding tasks. Overall, GCoT offers a promising path to more trustworthy multimodal reasoning and a rich dataset for future exploration including RL and embodied applications.

Abstract

Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.

Grounded Chain-of-Thought for Multimodal Large Language Models

TL;DR

The paper tackles visual hallucination in multimodal large language models by introducing Grounded Chain-of-Thought (GCoT), a framework that grounds stepwise reasoning to visual evidence and coordinates. A new MM-GCoT dataset with ~24k examples and three task types supports training and evaluation of GCoT, using metrics for answer accuracy, grounding accuracy, and answer-grounding consistency. Empirical results across 12 MLLMs show current models struggle with grounding consistency, and that GCoT training improves visual-spatial reasoning and reduces inconsistencies, with transfer to open-world QA and grounding tasks. Overall, GCoT offers a promising path to more trustworthy multimodal reasoning and a rich dataset for future exploration including RL and embodied applications.

Abstract

Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.

Paper Structure

This paper contains 9 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The comparison between common question answering (a) and the proposed grounded chain-of-thought (GCoT) (b). (a) An MLLM can directly output the correct answer, which however is often not based on what it, as shown in its attention heat map. This issue undermines its trustworthy application. (b) GCoT aims to help the MLLM make grounded reasoning step-by-step, and outputs the answer with coordinates as the intuitive basis.
  • Figure 2: Comparison between relevant vision-langauge tasks (a-d) and our GCoT (e). Phrase grounding (a) is similar to our GCoT in terms of grounding outputs, but it is only to detect mentioned instances in the caption. Grounded QA (c) also lacks the hidden reasoning steps compared to GCoT. VCoT (d) extends the common VQA of MLLMs (b) via providing more detailed answering thoughts, which is more about knowledge reasoning. In contrast, our GCoT aims to decompose the question into multiple task steps with grounded information, providing intuitive basis for the visual-spatial reasoning of MLLMs.
  • Figure 3: Examples of the proposed multi-modal grounded chain-of-thought (MM-GCoT). MM-GCoT has three splits of examples, namely Attribute (a), Judgement (b) and Object (c). Each example consists of multiple reasoning steps with grounded information, i.e., the spatial coordinates, serving the cultivation of GCoT capability for MLLMs. Meanwhile, MM-GCoT also reserves a set of example for the hallucination evaluation of MLLMs in terms the metrics of answer accuracy, grounding accuracy and answer-grounding consistency.
  • Figure 4: Illustration of two prompting settings for evaluating LLaVA llava-1.5 on MM-GCoT. (a) Answer-First: The MLLM generates a textual response, followed by producing the corresponding bounding box in subsequent conversational turns. (b) Grounding-First: The MLLM initially provides a visual bounding box, then responses with a text answer.
  • Figure 5: Visualization of the results from Qwen2.5-VL and InternVL2.5 models of different parameter scales under answer-first and grounding-first promptings, respectively. The blue, red and green bounding boxes represent predictions from base-scale models (Qwen2.5-VL-7B and InternVL2.5-38B), large-scale models (Qwen2.5-VL-72B and InternVL2.5-78B), and ground truth, respectively. These visualizations show that super-large MLLMs exhibit poorer consistency in contrast.
  • ...and 2 more figures