Table of Contents
Fetching ...

Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination

Dingchen Yang, Bowen Cao, Guang Chen, Changjun Jiang

TL;DR

P Pensieve is proposed, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination and aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.

Abstract

Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.

Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination

TL;DR

P Pensieve is proposed, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination and aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.

Abstract

Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.
Paper Structure (56 sections, 14 equations, 15 figures, 8 tables)

This paper contains 56 sections, 14 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Visual hallucinations in image captions and an illustration of our proposed method Pensieve. During inference, MLLMs are enabled to retrospect relevant images as references and compare their visual content with the test image. Pensieve is capable of correcting erroneous object categories, attributes, activities, position, and numbers, wherever they occur in the sentence. Pensieve also facilitates MLLMs to identify visual details in the image (e.g., the traffic light with three green lights).
  • Figure 2: Our visual hallucination analysis pipeline and results. We investigate LLaVA1.5's predictions with alternative visual inputs in the same context. Such that the difference between $y_t$'s and $\hat{y_t}$'s confidence score distribution manifests the influence of the visual modality to MLLMs' prediction. We find that LLaVA1.5 is aware of the accurate visual cues amidst hallucination, as the visual information contributed +5.008 scores for the accurate candidate $\_gray$. However, the visual input also erroneously advocates for inaccurate candidates $\_green$ (+3.898) and $\_brown$ (+4.250). We also observe that images with similar semantics and appearance can induce analogous visual hallucinations, and leverage this phenomenon to assist MLLMs in discerning accurate content. This test sample is from OpenImages validation set kuznetsova2020open.
  • Figure 3: Our approach identifies erroneous candidates that are mistakenly supported by the visual branch by leveraging the analogous visual hallucinations among similar images. Our reference database comprises a variety of images. During inference, relevant references are retrieved from this database, and MLLM generates distinct prediction for each reference in the same context. The predicted scores are then subtracted to highlight the accurate candidates.
  • Figure 4: Qualitative results on LLaVA-Bench in the wild. Pensieve effectively mitigates visual hallucination for both MLLMs, while VCD and DoLa may induce extra hallucinations. We present the test images with corresponding visual references, which illustrate a similar scenario but exhibit nuanced differences compared to the test image. We omit DoLa's result for LLaVA1.5 as it is identical to the original one.
  • Figure 5: A reference database containing 113k reference images from COCO Caption Karpathy train (T) and restval (V) splits can improve performance across all metrics. to address potential noisy retrieval results. Overall, the performance gain positively correlates with the similarity between the test image and the references. The baseline (depicted as the grey horizontal line) is Exp.1 in \ref{['tab:whoops_ablation']}. rand denotes sampling random references (Exp.3).
  • ...and 10 more figures