Table of Contents
Fetching ...

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Xinyu Lyu, Beitao Chen, Lianli Gao, Jingkuan Song, Heng Tao Shen

TL;DR

This work tackles object hallucination in large vision-language models by introducing Hallucination-Induced Optimization (HIO), which uses a fine-tuned Contrary Bradley-Terry Model to amplify contrasts between hallucinatory and correct tokens. By training an enhanced 'Evil' LVLM to induce multiple hallucinations and then leveraging its logits to sharpen decision boundaries, HIO strengthens contrastive decoding and substantially reduces hallucinations. Empirical results across POPE, CHAIR, and MME benchmarks demonstrate state-of-the-art or competitive performance and solid ablations confirm the contribution of CBTM, AMTH, and ACI. The approach provides a principled, theoretically grounded pathway to more reliable multimodal generation with practical implications for deployed LVLM systems.

Abstract

Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

TL;DR

This work tackles object hallucination in large vision-language models by introducing Hallucination-Induced Optimization (HIO), which uses a fine-tuned Contrary Bradley-Terry Model to amplify contrasts between hallucinatory and correct tokens. By training an enhanced 'Evil' LVLM to induce multiple hallucinations and then leveraging its logits to sharpen decision boundaries, HIO strengthens contrastive decoding and substantially reduces hallucinations. Empirical results across POPE, CHAIR, and MME benchmarks demonstrate state-of-the-art or competitive performance and solid ablations confirm the contribution of CBTM, AMTH, and ACI. The approach provides a principled, theoretically grounded pathway to more reliable multimodal generation with practical implications for deployed LVLM systems.

Abstract

Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.
Paper Structure (20 sections, 17 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 17 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: (Left) Challenges and Solutions of Contrast Decoding Strategy. Visual Contrastive Decoding, despite introducing perturbations to induce hallucinations, fails to effectively enlarge the logits gap between hallucinatory and targeted tokens, resulting in unsatisfactory outputs. On the contrary, our method addresses the issue by significantly amplifying the logits gap between hallucinatory and targeted tokens. (Right) The performance of various methods on CHAIR metrics. Our HIO generates descriptions with fewer hallucination tokens compared to other visual contrastive decoding methods, achieving lower scores on the CHAIRs and CHAIRi metrics.
  • Figure 2: An overview of Hallucination-Induced Optimization (HIO). Our approach comprises two phases: the training stage and inference decoding. During the training stage, given an input image, a query, and a manually annotated correction, the Large Visual Language Model (LVLM) produces multiple instances of hallucinated content. We then apply our Hallucination-Induced Optimization (HIO) method to train an 'Evil' LVLM by inducing hallucinations from the original LVLM. In the inference phase, the logits from the trained 'Evil' LVLM are used to contrast with those generated by the original LVLM, effectively reducing the presence of hallucinations.
  • Figure 3: The Difference between hallucination token and target token. The horizontal axis represents the progression of training steps, while the vertical axis quantifies the disparity in logits, calculated as the hallucination token's logits minus those of the target token. It is evident that ACI effectively augments the distinction between the hallucination and target tokens.