Table of Contents
Fetching ...

Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Hongseok Oh, Wonseok Hwang

TL;DR

The paper challenges the view that vision-encoder capacity is the main cause of object hallucination in LVLMs by examining discriminative, retrieval-style evaluations (OHD-Caps). It introduces Fine-grained CLIPScore (F-CLIPScore), which augments sentence-level CLIPScore with noun-level text embeddings via spaCy, achieving a +39.6 percentage-point improvement on OHD-Caps without any training. The authors further show that F-CLIPScore-based data filtering during LVLM pretraining reduces object hallucination, yielding a +4.9% POPE accuracy gain, underscoring the role of data quality and evaluation metrics. Overall, the work suggests that hallucinations arise from factors beyond vision-encoder capacity and provides a practical, lightweight tool for both evaluation and data curation in vision-language systems.

Abstract

Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining). Our code is publicly available at https://github.com/abzb1/f-clip

Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

TL;DR

The paper challenges the view that vision-encoder capacity is the main cause of object hallucination in LVLMs by examining discriminative, retrieval-style evaluations (OHD-Caps). It introduces Fine-grained CLIPScore (F-CLIPScore), which augments sentence-level CLIPScore with noun-level text embeddings via spaCy, achieving a +39.6 percentage-point improvement on OHD-Caps without any training. The authors further show that F-CLIPScore-based data filtering during LVLM pretraining reduces object hallucination, yielding a +4.9% POPE accuracy gain, underscoring the role of data quality and evaluation metrics. Overall, the work suggests that hallucinations arise from factors beyond vision-encoder capacity and provides a practical, lightweight tool for both evaluation and data curation in vision-language systems.

Abstract

Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining). Our code is publicly available at https://github.com/abzb1/f-clip

Paper Structure

This paper contains 12 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: A representative example from the OHD-Caps test set is shown. The original CLIP selects a sentence mentioning “children” and “tennis” but adds hallucinated objects. The OHD-Caps-trained CLIP hallucinates “dog” and “frisbee” without introducing new content. In contrast, F-CLIPScore selects a sentence that preserves the original meaning without hallucinations.
  • Figure 2: Histograms of cosine similarity between two embedding vectors: one from the original CLIP-L and the other from the OHD-Caps-trained CLIP-L. (a) The histogram from the vision encoders. Correct (blue) indicates the scores are from the examples where original CLIP-L predict the ground truth. The other examples are colored in orange. (b) The histogram from the text encoders. Same color scheme is employed. Measured only on ground truth text. (c) The cosine similarity distribution between text embeddings of text without object hallucination (purple) and with object hallucination text (green) for all samples.
  • Figure 3: The graphical representation of F-CLIPScore.
  • Figure 4: Randomly sampled 10 image-caption pairs each from (I) samples filtered by CLIPScore and (II) samples filtered by F-CLIPScore at the 30% filtering rate. Images overlapping between the two metrics were excluded.
  • Figure 5: The figure shows the top and bottom 10 samples from the entire LLaVA-Pretrain dataset, sorted by the difference between the F-CLIPScore rank and the CLIPScore rank. (I) represents samples that CLIPScore rated higher than F-CLIPScore, while (II) represents the opposite cases. Each caption is written below its corresponding image.