Table of Contents
Fetching ...

The Hard Positive Truth about Vision-Language Compositionality

Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, Ranjay Krishna

TL;DR

The need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related"positive"concepts is suggested, indicating the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related"positive" concepts.

Abstract

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.

The Hard Positive Truth about Vision-Language Compositionality

TL;DR

The need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related"positive"concepts is suggested, indicating the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related"positive" concepts.

Abstract

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.
Paper Structure (29 sections, 2 equations, 4 figures, 10 tables)

This paper contains 29 sections, 2 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Prior work shows that CLIP is insensitive to minor changes to the input caption, incorrectly assigning a higher score to a hard negative caption $c_n$ than to the original caption $c$. While hard negative finetuning (here, doveh2023dense) fixes the ordering between the original caption and the hard negative, we reveal that the resulting model becomes oversensitive and incorrectly assigns a lower score to a hard positive caption $c_p$. We mitigate this by finetuning with both hard negatives and hard positives, leading to an overall correct understanding of the different captions, and achieving a more well-rounded sense of compositionality (real example shown).
  • Figure 2: Our REPLACE and SWAP evaluation sets. REPLACE replaces either an attribute or a relation in the original caption $c$ to obtain $c_n$ and $c_p$. SWAP swaps object-attribute associations in the original caption $c$ to obtain $c_n$ and $c_p$.
  • Figure 3: Sample predictions of CLIP, a hard negative finetuned model doveh2023dense, and our model. Top: Considering hard negatives alone provides an incomplete picture of compositionality. Bottom: Hard negative finetuning can harm model performance. Both: Hard negative finetuning incorrectly lowers scores of the original caption, unlike our model.
  • Figure 4: Random samples of REPLACE and SWAP. The first two REPLACE samples are from Relations, and the third from Attributes.