ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models
Yeji Park, Deokyeong Lee, Junsuk Choe, Buru Chang
TL;DR
ConVis tackles hallucinations in multimodal LLMs by introducing a decoding-time, training-free contrastive decoding approach that uses a text-to-image generator to visualize hallucinations and derive visual-contrast signals. During decoding, the MLLM caption is fed to a T2I model to produce images, and the original and reconstructed images are used to form a contrastive logit distribution $\hat{f_\theta}$ that amplifies hallucination-related tokens. Experiments across five benchmarks (CHAIR, HallusionBench, POPE, MME, LLaVA-Bench) with backbones such as LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2 show reduced hallucinations while preserving language capabilities. The results suggest that higher-quality T2I models and caption diversity enhance effectiveness, though limitations remain in VQA tasks where object-specific reasoning may misalign with reconstructed visuals. The core contribution is a practical, data-free decoding strategy that leverages visual contrast signals to improve the reliability of multimodal responses.
Abstract
Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.
