Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Jihoon Lee; Min Song

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Jihoon Lee, Min Song

TL;DR

RVCD targets Object Hallucination in Large Vision-Language Models by introducing a train-free, plug-and-play decoding method that leverages negative and positive logits derived from explicit single-concept reference images at each decoding step, i.e., $N_t$ and $P_t$. The approach retrieves these references from a CHAIR-aligned database and applies a logit adjustment with parameters $\alpha$ and $\beta$ to produce $f_{adjusted}$, guiding the final token selection. Across MSCOCO and related benchmarks (CHAIR, BLEU, POPE, MME, LLaVA-Bench), RVCD achieves substantial OH reduction while preserving caption quality and maintaining decoding latency competitive with or faster than prior SOTA decoding methods. The default settings $(\alpha, \beta)=(1,0.1)$ balance hallucination suppression with ground-truth recovery, and improvements in object-detection accuracy further boost performance.

Abstract

Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

TL;DR

Abstract

Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)