Table of Contents
Fetching ...

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

Avshalom Manevich, Reut Tsarfaty

TL;DR

The paper tackles object hallucinations in large vision-language models by introducing Language Contrastive Decoding (LCD), a decoding-time method that contrasts LVLM outputs with an auxiliary language model conditioned on text to suppress language biases. LCD uses an entropy-weighted contrastive formula to reweight next-token probabilities during generation, enabling the LVLM to produce more faithful descriptions without retraining. Empirical results on POPE and image-detailed descriptions show LCD reduces hallucinations across multiple LVLM architectures while preserving or improving caption quality, with notable gains for InstructBLIP variants; GPT-4V assessments also indicate higher accuracy. The work demonstrates the practicality and effectiveness of LVLM-specific decoding strategies and motivates further exploration of decoding-time techniques to improve multimodal AI reliability.

Abstract

Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

TL;DR

The paper tackles object hallucinations in large vision-language models by introducing Language Contrastive Decoding (LCD), a decoding-time method that contrasts LVLM outputs with an auxiliary language model conditioned on text to suppress language biases. LCD uses an entropy-weighted contrastive formula to reweight next-token probabilities during generation, enabling the LVLM to produce more faithful descriptions without retraining. Empirical results on POPE and image-detailed descriptions show LCD reduces hallucinations across multiple LVLM architectures while preserving or improving caption quality, with notable gains for InstructBLIP variants; GPT-4V assessments also indicate higher accuracy. The work demonstrates the practicality and effectiveness of LVLM-specific decoding strategies and motivates further exploration of decoding-time techniques to improve multimodal AI reliability.

Abstract

Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.
Paper Structure (21 sections, 5 equations, 5 figures, 4 tables)

This paper contains 21 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An illustration of LLM vs. LVLM token probabilities given an image and a text prefix mid-generation. According to the LLM, the word "dog" is much more likely to appear next. LCD dynamically contrasts these probabilities to mitigate language biases in LVLM outputs.
  • Figure 2: Prompt used to evaluate descriptions with GPT4-V, taken from yin2023woodpecker
  • Figure : COCO Image 461331
  • Figure : COCO Image 498100
  • Figure : COCO Image 379404