Table of Contents
Fetching ...

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Xiaoye Qu, Jiashuo Sun, Wei Wei, Yu Cheng

TL;DR

This work tackles hallucination in large vision-language systems by introducing MVP, a training-free framework that combines multi-view information seeking with certainty-driven multi-path reasoning to improve decoding reliability. By generating bottom-up, regular, and top-down captions, MVP enriches image understanding, while its path-based certainty aggregation stabilizes output against misleading tokens. Evaluations on POPE and MME across four LVLMs show MVP consistently outperforms vanilla baselines and other training-free methods, with insights from thorough ablations and decoding-strategy analyses. The approach is plug-and-play and adaptable to different decoding strategies, offering a practical pathway to more trustworthy LVLMs without additional training costs.

Abstract

Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbf{MVP}, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbf{M}ulti-\textbf{V}iew Multi-\textbf{P}ath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \url{https://github.com/GasolSun36/MVP}.

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

TL;DR

This work tackles hallucination in large vision-language systems by introducing MVP, a training-free framework that combines multi-view information seeking with certainty-driven multi-path reasoning to improve decoding reliability. By generating bottom-up, regular, and top-down captions, MVP enriches image understanding, while its path-based certainty aggregation stabilizes output against misleading tokens. Evaluations on POPE and MME across four LVLMs show MVP consistently outperforms vanilla baselines and other training-free methods, with insights from thorough ablations and decoding-strategy analyses. The approach is plug-and-play and adaptable to different decoding strategies, offering a practical pathway to more trustworthy LVLMs without additional training costs.

Abstract

Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbf{MVP}, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbf{M}ulti-\textbf{V}iew Multi-\textbf{P}ath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \url{https://github.com/GasolSun36/MVP}.
Paper Structure (24 sections, 7 equations, 7 figures, 7 tables)

This paper contains 24 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Given an image, LVLM fails to recognize objects or miscounts the quantity.
  • Figure 2: An overview of our Multi-View Multi-Path Reasoning. (1) Seeking image information from multiple perspectives including top-down, regular, and bottom-up views. (2) Augmenting the global vision information with each view information. (3) The certainty-driven decoding corresponding to each view quantifies and aggregates certainty scores for each potential answer among multiple decoding paths. The final results are obtained by comparing certainty scores among all candidates.
  • Figure 3: Comparison of the number of objects between regular and multi-view caption. The statistic is obtained in MSCOCO Popular part of POPE benchmark.
  • Figure 4: An illustration of certainty-driven multi-path reasoning. The correct answer is "No". "Score" denotes the certainty score of the answer token. "Yes", 'Based", "The" are candidate decoding tokens at first place. The three decoding paths are greedy decoding with these candidate tokens.
  • Figure 5: MME full set results on LLaVA-1.5, Qwen-VL, InstructBLIP, and mPLUG-Owl2 on 14 subtasks. The orange lines represent the vanilla model and the blue lines denotes our MVP model.
  • ...and 2 more figures