Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Xintong Wang; Jingheng Pan; Liang Ding; Chris Biemann

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Xintong Wang, Jingheng Pan, Liang Ding, Chris Biemann

TL;DR

This work tackles hallucinations in large vision-language models by introducing Instruction Contrastive Decoding (ICD), a training-free, LVLM-agnostic inference technique that contrasts standard instructions with disturbance-instructed variants to suppress hallucinated concepts. ICD uses a highlight-then-detach contrastive objective and adaptive plausibility constraints to reduce object- and attribute-level hallucinations while preserving or enhancing general perception tasks. Extensive evaluation on POPE, MME, and LLaVa-Bench across multiple backbones demonstrates substantial improvements over baseline decoding and a prior visual-contrastive method, highlighting ICD's effectiveness and versatility. The results suggest ICD as a practical, deployment-friendly strategy to improve the reliability of multimodal AI systems, with potential for integration with complementary approaches for further gains.

Abstract

Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules. ICD contrasts distributions from standard and instruction disturbance, thereby increasing alignment uncertainty and effectively subtracting hallucinated concepts from the original distribution. Through comprehensive experiments on discriminative benchmarks (POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that ICD significantly mitigates both object-level and attribute-level hallucinations. Moreover, our method not only addresses hallucinations but also significantly enhances the general perception and recognition capabilities of LVLMs.

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

TL;DR

Abstract

Paper Structure (24 sections, 7 equations, 7 figures, 6 tables)

This paper contains 24 sections, 7 equations, 7 figures, 6 tables.

Introduction
Related Work
Large Vision-Language Models.
Hallucination in VLMs.
Method
Inference in LVLMs
Instruction Can Amplify Hallucination
Instruction Contrastive Decoding
Contrastive Decoding with Disturbance
Adaptive Plausibility Constrains
Experiment
Experimental Settings
Datasets and Evaluation Metrics
LVLM Baselines
Experimental Results
...and 9 more sections

Figures (7)

Figure 1: An illustration on inference framework and contrastive decoding process of ICD method. At the core (middle orange box), the framework integrates a frozen image encoder, LLM, and query vectors (gray box) within the Q-Former, focusing solely on adjusting the standard and disturbance instructions. The latter, exemplified by adding role prefixes like 'You are a confused object detector,' aims to increase multimodal alignment uncertainty. This results in two distinct distributions: one from the standard instruction and another influenced by the disturbance. The contrastive decoding method (right orange box) highlights how disturbance instructions amplify hallucinated concepts ('person and fork'), which are then corrected by subtracting probabilities derived from the standard instruction, ensuring accurate recognition of the correct concept 'dog'.
Figure 2: The left figure shows the top frequent objects hallucination ratio and the right depicts the ratio of co-occurring object hallucinations with dining table.
Figure 3: Performance on MME full benchmark. The left figure in purple is the results based on miniGPT4, while the right figure in blue is the results based on InstructBLIP.
Figure 4: Performance of the VCD-enhanced ICD method on MME Subset. The underlying LVLM is InstructBLIP.
Figure 5: Qualitative analysis on LLava-Bench. The left figure highlights the statistical bias, and the right figure shows the language prior that contributes to hallucinations in LVLMs. Hallucinated concepts have been highlighted in red.
...and 2 more figures

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

TL;DR

Abstract

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)