Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization
Xinyu Lyu, Beitao Chen, Lianli Gao, Jingkuan Song, Heng Tao Shen
TL;DR
This work tackles object hallucination in large vision-language models by introducing Hallucination-Induced Optimization (HIO), which uses a fine-tuned Contrary Bradley-Terry Model to amplify contrasts between hallucinatory and correct tokens. By training an enhanced 'Evil' LVLM to induce multiple hallucinations and then leveraging its logits to sharpen decision boundaries, HIO strengthens contrastive decoding and substantially reduces hallucinations. Empirical results across POPE, CHAIR, and MME benchmarks demonstrate state-of-the-art or competitive performance and solid ablations confirm the contribution of CBTM, AMTH, and ACI. The approach provides a principled, theoretically grounded pathway to more reliable multimodal generation with practical implications for deployed LVLM systems.
Abstract
Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.
