Table of Contents
Fetching ...

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Jihoon Lee, Min Song

TL;DR

RVCD targets Object Hallucination in Large Vision-Language Models by introducing a train-free, plug-and-play decoding method that leverages negative and positive logits derived from explicit single-concept reference images at each decoding step, i.e., $N_t$ and $P_t$. The approach retrieves these references from a CHAIR-aligned database and applies a logit adjustment with parameters $\alpha$ and $\beta$ to produce $f_{adjusted}$, guiding the final token selection. Across MSCOCO and related benchmarks (CHAIR, BLEU, POPE, MME, LLaVA-Bench), RVCD achieves substantial OH reduction while preserving caption quality and maintaining decoding latency competitive with or faster than prior SOTA decoding methods. The default settings $(\alpha, \beta)=(1,0.1)$ balance hallucination suppression with ground-truth recovery, and improvements in object-detection accuracy further boost performance.

Abstract

Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

TL;DR

RVCD targets Object Hallucination in Large Vision-Language Models by introducing a train-free, plug-and-play decoding method that leverages negative and positive logits derived from explicit single-concept reference images at each decoding step, i.e., and . The approach retrieves these references from a CHAIR-aligned database and applies a logit adjustment with parameters and to produce , guiding the final token selection. Across MSCOCO and related benchmarks (CHAIR, BLEU, POPE, MME, LLaVA-Bench), RVCD achieves substantial OH reduction while preserving caption quality and maintaining decoding latency competitive with or faster than prior SOTA decoding methods. The default settings balance hallucination suppression with ground-truth recovery, and improvements in object-detection accuracy further boost performance.

Abstract

Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

Paper Structure

This paper contains 28 sections, 14 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Detection precision for YOLO and LVLM detectors on MSCOCO Validation 2014 lin2014mscoco. Hal ($\cdot$) shows the proportion of hallucinated objects from greedy-decoded captions detected by YOLO and LVLMs VQA that were true hallucinations. GT ($\cdot$) illustrates the proportion of objects correctly identified as existing by YOLO and LVLMs. While both perform similarly in detecting existing objects, YOLO excels in hallucination detection, motivating us to transfer YOLO’s strength to LVLMs for correcting hallucinated objects. The statistical details are provided in Appendix \ref{['appendix_precision']}.
  • Figure 2: Overall pipeline of our RVCD. $x$ denotes the input prompt, and $v$ denotes the input image. $n_{vi}$ and $p_{vi}$ are images retrieved from the image database, representing single-concept images for objects identified as hallucinations (appearing only in the draft caption) and ground truth (appearing in both the OD model and draft caption), respectively. $N_t$ and $P_t$ represent the sets of logits generated from $n_{vi}$ and $p_{vi}$, respectively. At each decoding step, the LVLM processes $x$, $v$, $N_t$, $P_t$, and ongoing output tokens $y_{<t}$, which are then integrated according to our proposed formula. This iterative decoding process produces the final caption of RVCD.
  • Figure 3: AI generated single concept image DB. we adopted FLUX.1-dev yang2024flux to generate $336$ * $336$ pixels images representing only a single concept corresponding to each word in the MSCOCO objects synonyms dictionary and stored them in an image database. Images were stored in the database only if the LVLM’s output captions and the image generation model’s input prompts both mentioned the corresponding concept. Otherwise, the hyperparameters of the image generation model were adjusted, and the images were regenerated. This process was repeated until images were generated for every word in the dictionary.
  • Figure 4: Top-5 token probabilities for each single concept images. When an LVLM is tasked with responding to an image using a single word, it frequently includes tokens representing other objects registered in the MSCOCO dictionary among its top-5 tokens, even for single-concept images. Details in Appendix \ref{['sec:appendixF']}.
  • Figure 5: Comparison of different decoding baselines on MME Metric with llava-1.5 as a backbone LVLM. Refer to Table \ref{['tab:mme_details']} for detailed information, including MiniGPT-4 and mPLUG-Owl2.
  • ...and 5 more figures