Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models
Jihoon Lee, Min Song
TL;DR
RVCD targets Object Hallucination in Large Vision-Language Models by introducing a train-free, plug-and-play decoding method that leverages negative and positive logits derived from explicit single-concept reference images at each decoding step, i.e., $N_t$ and $P_t$. The approach retrieves these references from a CHAIR-aligned database and applies a logit adjustment with parameters $\alpha$ and $\beta$ to produce $f_{adjusted}$, guiding the final token selection. Across MSCOCO and related benchmarks (CHAIR, BLEU, POPE, MME, LLaVA-Bench), RVCD achieves substantial OH reduction while preserving caption quality and maintaining decoding latency competitive with or faster than prior SOTA decoding methods. The default settings $(\alpha, \beta)=(1,0.1)$ balance hallucination suppression with ground-truth recovery, and improvements in object-detection accuracy further boost performance.
Abstract
Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
