Table of Contents
Fetching ...

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

Hanyao Wang, Yibing Zhan, Liu Liu, Liang Ding, Yan Yang, Jun Yu

TL;DR

A Balanced Score with Auxiliary Prompts (BSAP) is proposed to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning and validated that the strategy could be applied in other types of pretrained cross-modal models, such as ALBEF and BLIP.

Abstract

Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image with the highest score when asked to figure out which image perfectly matches one given query text among several candidate images even though CLIP knows contents in the image. Accordingly, we propose a Balanced Score with Auxiliary Prompts (BSAP) to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning. Specifically, we first design auxiliary prompts to provide multiple reference outcomes for every single image retrieval, then the outcomes derived from each retrieved image in conjunction with the target text are normalized to obtain the final similarity, which alleviates hallucinations in the model. Additionally, we can merge CLIP's original results and BSAP to obtain a more robust hybrid outcome (BSAP-H). Extensive experiments on two typical zero-shot learning tasks, i.e., Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to demonstrate the effectiveness of our BSAP. Specifically, when evaluated on the validation dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6%. Further, we validate that our strategy could be applied in other types of pretrained cross-modal models, such as ALBEF and BLIP.

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

TL;DR

A Balanced Score with Auxiliary Prompts (BSAP) is proposed to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning and validated that the strategy could be applied in other types of pretrained cross-modal models, such as ALBEF and BLIP.

Abstract

Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image with the highest score when asked to figure out which image perfectly matches one given query text among several candidate images even though CLIP knows contents in the image. Accordingly, we propose a Balanced Score with Auxiliary Prompts (BSAP) to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning. Specifically, we first design auxiliary prompts to provide multiple reference outcomes for every single image retrieval, then the outcomes derived from each retrieved image in conjunction with the target text are normalized to obtain the final similarity, which alleviates hallucinations in the model. Additionally, we can merge CLIP's original results and BSAP to obtain a more robust hybrid outcome (BSAP-H). Extensive experiments on two typical zero-shot learning tasks, i.e., Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to demonstrate the effectiveness of our BSAP. Specifically, when evaluated on the validation dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6%. Further, we validate that our strategy could be applied in other types of pretrained cross-modal models, such as ALBEF and BLIP.
Paper Structure (33 sections, 8 equations, 9 figures, 8 tables)

This paper contains 33 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (a) The illustration of the text-to-image retrieval hallucination of CLIP (ViT-B/32). The text-to-image retrieval hallucination is caused by the range of similarity scores. (b) Performance of the proposed BSAP. When BSAP is employed, the hallucination in the text-to-image retrieval task is resolved.
  • Figure 2: Examples and analyses of hallucination in CLIP on text-to-image retrieval tasks. The blue bounding boxes represent the Ground Truth (GT), while the red bounding boxes depict the predicted results from ReClip.
  • Figure 3: The illustration of the similarity scores generated by the CLIP model, in which 100 images of the dog and 100 images of the human are selected. In the graphs, blue dots represent the result for dog images and red dots represent the result for human images. (a) Similarity score processed by CLIP for the text caption "person" with 100 images of a dog and 100 images of a human. The results for dog images are sorted from large to small and the results for human images are sorted from small to large. (b) Similarity score processed by CLIP for the caption text "dog" with 100 images of a dog and 100 images of a human. The results for the dog images are sorted from small to large and the results for human images are sorted from large to small.
  • Figure 4: The process of BSAP to get balanced results with auxiliary prompts.
  • Figure 5: Apply BSAP to Reclip. Blue sections represent the original core of ReCLIP. Orange sections denote the components that our BSAP and BSAP-H integrate into ReCLIP. Red boxes highlight original scores and our BSAP.
  • ...and 4 more figures