Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Yufang Liu; Tao Ji; Changzhi Sun; Yuanbin Wu; Aimin Zhou

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Yufang Liu, Tao Ji, Changzhi Sun, Yuanbin Wu, Aimin Zhou

TL;DR

It is demonstrated that the counterfactual data augmentation method can effectively mitigate object hallucinations for CLIP model, and it is shown the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive performance, yet research has pointed out a serious issue with object hallucinations within these models. However, there is no clear conclusion as to which part of the model these hallucinations originate from. In this paper, we present an in-depth investigation into the object hallucination problem specifically within the CLIP model, which serves as the backbone for many state-of-the-art vision-language systems. We unveil that even in isolation, the CLIP model is prone to object hallucinations, suggesting that the hallucination problem is not solely due to the interaction between vision and language modalities. To address this, we propose a counterfactual data augmentation method by creating negative samples with a variety of hallucination issues. We demonstrate that our method can effectively mitigate object hallucinations for CLIP model, and we show the the enhanced model can be employed as a visual encoder, effectively alleviating the object hallucination issue in LVLMs.

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

TL;DR

Abstract

Paper Structure (29 sections, 8 equations, 3 figures, 13 tables)

This paper contains 29 sections, 8 equations, 3 figures, 13 tables.

Introduction
Related Work
Large Vision-Language Model
Hallucination in LVLMs
The OHD-Caps Benchmark
Dataset Construction
Inserting Hallucinatory Objects
Removing existing Objects
Evaluation and Analysis
Models
Results
Analysis
Methodology
Experiments
Training Datasets
...and 14 more sections

Figures (3)

Figure 1: The pipeline of our benchmark creation process. For an image, we first use SEEM DBLP:conf/nips/ZouYZLLWWGL23 to identify objects within the image and obtain illusory objects that do not exist in the picture through different sampling strategies. Then we ask GPT to insert or delete objects in the original sentences to create negative samples. We provide both positive and negative samples to the CLIP model to observe if the model predicts the positive samples as having the highest score. This image is from the NoCaps dataset, and the model is CLIP ViT-B/32.
Figure 2: The performance of the model on the OHD-Caps dataset with different training data volumes provided. We report the average results of three random seeds.
Figure 3: Examples from our benchmark OHD-Caps. The three images in the figure are from the COCO, Flickr, and Nocaps datasets, respectively.

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

TL;DR

Abstract

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)