Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore
Hongseok Oh, Wonseok Hwang
TL;DR
The paper challenges the view that vision-encoder capacity is the main cause of object hallucination in LVLMs by examining discriminative, retrieval-style evaluations (OHD-Caps). It introduces Fine-grained CLIPScore (F-CLIPScore), which augments sentence-level CLIPScore with noun-level text embeddings via spaCy, achieving a +39.6 percentage-point improvement on OHD-Caps without any training. The authors further show that F-CLIPScore-based data filtering during LVLM pretraining reduces object hallucination, yielding a +4.9% POPE accuracy gain, underscoring the role of data quality and evaluation metrics. Overall, the work suggests that hallucinations arise from factors beyond vision-encoder capacity and provides a practical, lightweight tool for both evaluation and data curation in vision-language systems.
Abstract
Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining). Our code is publicly available at https://github.com/abzb1/f-clip
