Table of Contents
Fetching ...

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

TL;DR

This work presents the Bidirectional Vision-Language Compositionality (BiVLC) dataset and shows that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions.

Abstract

Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

TL;DR

This work presents the Bidirectional Vision-Language Compositionality (BiVLC) dataset and shows that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions.

Abstract

Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.
Paper Structure (59 sections, 3 equations, 10 figures, 7 tables)

This paper contains 59 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Given an image and two captions from SugarCrepe, BiVLC constructs an instance adding a negative image (Img-) generated from the negative caption (Caption-). The instance produces four retrieval examples: two for image-to-text retrieval and two for text-to-image retrieval.
  • Figure 2: Three instances of BiVLC. Bottom row with negative captions and the corresponding images created by us. From left to right, negative captions created by Replace, Swap and Add.
  • Figure 3: Diagram of dataset construction: Starting from SugarCrepe instances, uniformly format positive and hard negative captions (Step 1), generate hard negative images (Step 2), ask human annotators to choose the best generated image (Step 3), and filter out ambiguous instances (Step 4). As a result. we get BiVLC instances, consisting of 2 captions and 2 images.
  • Figure 4: When we train only with hard negative texts, the distance of the positive caption (Caption+) and the negative image (Image-) may be even smaller than the distance of the positive caption to the positive image (Image+) (left). When we add hard negative images, we force to increase the distance between the positive caption and the negative image, while minimizing the distance between the positive caption and image (right).
  • Figure 5: Example of a BiVLC instance after loading the dataset.
  • ...and 5 more figures