Table of Contents
Fetching ...

IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, Jun Liu

TL;DR

This work tackles hallucinations in large vision-language models (LVLMs) caused by over-reliance on linguistic priors. It introduces Image-Biased Decoding (IBD), a contrastive decoding framework that compares a standard LVLM with an image-biased variant to boost image-consistent tokens while suppressing text-driven errors. The method combines a lightweight image-biased attention mechanism, a contrastive token score, and dynamic adjustment across word types, along with prompt-tuning and an adaptive plausibility constraint. Across multiple LVLMs and evaluation metrics, IBD reduces hallucinations with minimal parameter overhead, demonstrating practical potential for safer, more truthful LVLM outputs.

Abstract

Despite achieving rapid developments and with widespread applications, Large Vision-Language Models (LVLMs) confront a serious challenge of being prone to generating hallucinations. An over-reliance on linguistic priors has been identified as a key factor leading to these hallucinations. In this paper, we propose to alleviate this problem by introducing a novel image-biased decoding (IBD) technique. Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM, thereby amplifying the correct information highly correlated with image content while mitigating the hallucinatory errors caused by excessive dependence on text. We further conduct a comprehensive statistical analysis to validate the reliability of our method, and design an adaptive adjustment strategy to achieve robust and flexible handling under varying conditions. Experimental results across multiple evaluation metrics verify that our method, despite not requiring additional training data and only with a minimal increase in model parameters, can significantly reduce hallucinations in LVLMs and enhance the truthfulness of the generated response.

IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding

TL;DR

This work tackles hallucinations in large vision-language models (LVLMs) caused by over-reliance on linguistic priors. It introduces Image-Biased Decoding (IBD), a contrastive decoding framework that compares a standard LVLM with an image-biased variant to boost image-consistent tokens while suppressing text-driven errors. The method combines a lightweight image-biased attention mechanism, a contrastive token score, and dynamic adjustment across word types, along with prompt-tuning and an adaptive plausibility constraint. Across multiple LVLMs and evaluation metrics, IBD reduces hallucinations with minimal parameter overhead, demonstrating practical potential for safer, more truthful LVLM outputs.

Abstract

Despite achieving rapid developments and with widespread applications, Large Vision-Language Models (LVLMs) confront a serious challenge of being prone to generating hallucinations. An over-reliance on linguistic priors has been identified as a key factor leading to these hallucinations. In this paper, we propose to alleviate this problem by introducing a novel image-biased decoding (IBD) technique. Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM, thereby amplifying the correct information highly correlated with image content while mitigating the hallucinatory errors caused by excessive dependence on text. We further conduct a comprehensive statistical analysis to validate the reliability of our method, and design an adaptive adjustment strategy to achieve robust and flexible handling under varying conditions. Experimental results across multiple evaluation metrics verify that our method, despite not requiring additional training data and only with a minimal increase in model parameters, can significantly reduce hallucinations in LVLMs and enhance the truthfulness of the generated response.
Paper Structure (18 sections, 9 equations, 9 figures, 5 tables)

This paper contains 18 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: An illustrative example of our method. Texts highlighted in red and green indicate erroneous prediction and correct prediction generated by the original LVLM and contrastive results, respectively.
  • Figure 2: (a) and (b) respectively present the statistical results for content words and function words in the COCO Caption dataset. Dark bars represent the proportion of ground tokens with the highest CD score among all candidate tokens, while light bars represent the proportion of ground tokens without the highest CD score. The statistical results on four LVLMs are reported, including InstructBLIP, MiniGPT-4, LLaVA-1.5 and Shikra.
  • Figure 3: Statistical results to illustrate the relationship between the prediction similarity of $\theta$ and $\hat{\theta}$ and the proportion of ground tokens having the maximum CD score among candidate tokens. X-axis denotes the range of Jensen-Shannon divergence (JSD) $d_{i}$ between the prediction results from $\theta$ and $\hat{\theta}$. $d_{i}$ is scaled by $1.5\times10^{4}$. A higher $d_{i}$ indicates lower similarity. Y-axis represents, for all time steps with its $d_{i}$ falling into each range, the proportion of time steps where the ground truth token has the highest CD score among all candidate tokens.
  • Figure 4: An example to show the problem of image-biased hallucinations in LVLMs. Texts highlighted in red and green indicate erroneous information and correct information generated by LLaVA-1.5, respectively.
  • Figure 5: Evaluation results assisted by GPT-4, including 4 metrics: the number of hallucinated sentences per image (HSPI), the number of hallucinated words per image (HWPI), the ratio of hallucinated sentences (HSR), and the ratio of hallucinated words (HWR). Lower values indicate fewer hallucinations.
  • ...and 4 more figures