Table of Contents
Fetching ...

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, Aimin Zhou

TL;DR

It is demonstrated that the counterfactual data augmentation method can effectively mitigate object hallucinations for CLIP model, and it is shown the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

TL;DR

It is demonstrated that the counterfactual data augmentation method can effectively mitigate object hallucinations for CLIP model, and it is shown the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.
Paper Structure (29 sections, 8 equations, 3 figures, 13 tables)

This paper contains 29 sections, 8 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: The pipeline of our benchmark creation process. For an image, we first use SEEM DBLP:conf/nips/ZouYZLLWWGL23 to identify objects within the image and obtain illusory objects that do not exist in the picture through different sampling strategies. Then we ask GPT to insert or delete objects in the original sentences to create negative samples. We provide both positive and negative samples to the CLIP model to observe if the model predicts the positive samples as having the highest score. This image is from the NoCaps dataset, and the model is CLIP ViT-B/32.
  • Figure 2: The performance of the model on the OHD-Caps dataset with different training data volumes provided. We report the average results of three random seeds.
  • Figure 3: Examples from our benchmark OHD-Caps. The three images in the figure are from the COCO, Flickr, and Nocaps datasets, respectively.