Table of Contents
Fetching ...

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, Tat-Seng Chua

TL;DR

A novel bottom-up reasoning framework inspired by human intuition in handling hallucinations is introduced, which systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning

TL;DR

A novel bottom-up reasoning framework inspired by human intuition in handling hallucinations is introduced, which systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.

Paper Structure

This paper contains 52 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: On the top, we illustrate three hallucination cases: overgeneralization (case 1), vision object hallucination (case 2), and text object conflict (case 3), where hallucinations are marked in red. On the bottom, we categorize hallucinations into three types: vision, text, and commonsense, ranging from perception to cognition levels.
  • Figure 2: Illustration of the overall framework of Dehall, consisting of six reasoning modules from perception to cognition.
  • Figure 3: The comparison of different CoT mechanisms.
  • Figure 4: Conflict resolution performance with varying numbers of in-context examples.
  • Figure 5: Illustration of example outputs. Case (a) and (b) outputs with and without question validation for input questions containing conflicts. Hallucinations are highlighted in red and non-hallucinated in green. The input raw questions are marked in green, and the adjusted questions in red. Case (c) shows a failure example. For more results, refer to the Appendix § F.2.
  • ...and 3 more figures