Table of Contents
Fetching ...

ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

Yeji Park, Deokyeong Lee, Junsuk Choe, Buru Chang

TL;DR

ConVis tackles hallucinations in multimodal LLMs by introducing a decoding-time, training-free contrastive decoding approach that uses a text-to-image generator to visualize hallucinations and derive visual-contrast signals. During decoding, the MLLM caption is fed to a T2I model to produce images, and the original and reconstructed images are used to form a contrastive logit distribution $\hat{f_\theta}$ that amplifies hallucination-related tokens. Experiments across five benchmarks (CHAIR, HallusionBench, POPE, MME, LLaVA-Bench) with backbones such as LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2 show reduced hallucinations while preserving language capabilities. The results suggest that higher-quality T2I models and caption diversity enhance effectiveness, though limitations remain in VQA tasks where object-specific reasoning may misalign with reconstructed visuals. The core contribution is a practical, data-free decoding strategy that leverages visual contrast signals to improve the reliability of multimodal responses.

Abstract

Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.

ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models

TL;DR

ConVis tackles hallucinations in multimodal LLMs by introducing a decoding-time, training-free contrastive decoding approach that uses a text-to-image generator to visualize hallucinations and derive visual-contrast signals. During decoding, the MLLM caption is fed to a T2I model to produce images, and the original and reconstructed images are used to form a contrastive logit distribution that amplifies hallucination-related tokens. Experiments across five benchmarks (CHAIR, HallusionBench, POPE, MME, LLaVA-Bench) with backbones such as LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2 show reduced hallucinations while preserving language capabilities. The results suggest that higher-quality T2I models and caption diversity enhance effectiveness, though limitations remain in VQA tasks where object-specific reasoning may misalign with reconstructed visuals. The core contribution is a practical, data-free decoding strategy that leverages visual contrast signals to improve the reliability of multimodal responses.

Abstract

Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.
Paper Structure (15 sections, 2 equations, 6 figures, 10 tables)

This paper contains 15 sections, 2 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The text-to-image model visualizes hallucinations (e.g., 'book') in the semantically reconstructed images based on the hallucinated caption, exhibiting differences (e.g., missing 'clock') from the original image.
  • Figure 2: The original and reconstructed image generate the contrastive logit distribution for the hallucinated tokens (e.g., 'book'). The reconstructed image tends to amplify the logits of tokens corresponding to the visualized hallucination.
  • Figure 3: The original and generated image produce the contrastive distribution for the hallucinated tokens (e.g., 'book'). The generated image tends to amplify the logits of tokens corresponding to the visualized hallucination.
  • Figure 4: Effect of the number of images with different captions.
  • Figure 5: KL divergence between output distributions across each decoding step when the MLLM is provided with the images and caption from Figure \ref{['fig:6_qualitative_samples']} (a). The KL divergence is significantly elevated for the hallucinated token "car".
  • ...and 1 more figures