Table of Contents
Fetching ...

COLA: A Benchmark for Compositional Text-to-image Retrieval

Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

TL;DR

COLA introduces a focused benchmark for evaluating compositional attribute–object binding in text-to-image retrieval and demonstrates that a lightweight multimodal adapter (MM-Adapter) significantly improves compositional reasoning on COLA, outperforming standard fine-tuning and prompt-tuning. The study shows COLA is more challenging than CREPE, highlighting remaining gaps between machine and human performance and the importance of cross-modal adaptation trained on attribute–object data. Across CLIP and FLAVA, multimodal adaptation yields strong gains, suggesting that training multimodal layers with contrastive attribute–object data is key. The work provides a rigorous data and evaluation framework, analyses multiple data configurations, and offers practical baselines for future research in compositional vision–language understanding.

Abstract

Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve images with the correct configuration of attributes and objects and avoid choosing a distractor image with the same objects and attributes but in the wrong configuration. Cola contains about 1.2k composed queries of 168 objects and 197 attributes on around 30K images. Our human evaluation finds that Cola is 83.33% accurate, similar to contemporary compositionality benchmarks. Using Cola as a testbed, we explore empirical modeling designs to adapt pre-trained vision-language models to reason compositionally. We explore 6 adaptation strategies on 2 seminal vision-language models, using compositionality-centric test benchmarks - Cola and CREPE. We find the optimal adaptation strategy is to train a multi-modal attention layer that jointly attends over the frozen pre-trained image and language features. Surprisingly, training multimodal layers on CLIP performs better than tuning a larger FLAVA model with already pre-trained multimodal layers. Furthermore, our adaptation strategy improves CLIP and FLAVA to comparable levels, suggesting that training multimodal layers using contrastive attribute-object data is key, as opposed to using them pre-trained. Lastly, we show that Cola is harder than a closely related contemporary benchmark, CREPE, since simpler fine-tuning strategies without multimodal layers suffice on CREPE but not on Cola. However, we still see a significant gap between our best adaptation and human accuracy, suggesting considerable room for further research.

COLA: A Benchmark for Compositional Text-to-image Retrieval

TL;DR

COLA introduces a focused benchmark for evaluating compositional attribute–object binding in text-to-image retrieval and demonstrates that a lightweight multimodal adapter (MM-Adapter) significantly improves compositional reasoning on COLA, outperforming standard fine-tuning and prompt-tuning. The study shows COLA is more challenging than CREPE, highlighting remaining gaps between machine and human performance and the importance of cross-modal adaptation trained on attribute–object data. Across CLIP and FLAVA, multimodal adaptation yields strong gains, suggesting that training multimodal layers with contrastive attribute–object data is key. The work provides a rigorous data and evaluation framework, analyses multiple data configurations, and offers practical baselines for future research in compositional vision–language understanding.

Abstract

Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve images with the correct configuration of attributes and objects and avoid choosing a distractor image with the same objects and attributes but in the wrong configuration. Cola contains about 1.2k composed queries of 168 objects and 197 attributes on around 30K images. Our human evaluation finds that Cola is 83.33% accurate, similar to contemporary compositionality benchmarks. Using Cola as a testbed, we explore empirical modeling designs to adapt pre-trained vision-language models to reason compositionally. We explore 6 adaptation strategies on 2 seminal vision-language models, using compositionality-centric test benchmarks - Cola and CREPE. We find the optimal adaptation strategy is to train a multi-modal attention layer that jointly attends over the frozen pre-trained image and language features. Surprisingly, training multimodal layers on CLIP performs better than tuning a larger FLAVA model with already pre-trained multimodal layers. Furthermore, our adaptation strategy improves CLIP and FLAVA to comparable levels, suggesting that training multimodal layers using contrastive attribute-object data is key, as opposed to using them pre-trained. Lastly, we show that Cola is harder than a closely related contemporary benchmark, CREPE, since simpler fine-tuning strategies without multimodal layers suffice on CREPE but not on Cola. However, we still see a significant gap between our best adaptation and human accuracy, suggesting considerable room for further research.
Paper Structure (21 sections, 21 figures, 5 tables)

This paper contains 21 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: We present $\mathcal{C}ola$, where a model has to Compose Objects Localized with Attributes. To solve $\mathcal{C}ola$, a model must match the correct image to the correct caption, not a distractor image with the same objects and attributes but in the wrong configuration. We explore the design space of possible mechanisms to adapt existing models to this task; we show that a simple multimodal adaptation method to finetune pre-trained vision-language representations works best.
  • Figure 2: a) $\mathcal{C}ola$ multi-object setting validation set: a human-cleaned difficult validation set for testing attribute-object binding. The two images have similar objects and attributes but in different configurations. A model must match the correct images to the correct captions. b) The optimal adaptation strategy (MM-Adapter): a lightweight multimodal transformer encoder on top of frozen pre-trained encoders. The multimodal encoder crafts a stronger representation by cross-attending to image patches and text tokens to attach the correct attributes to the correct objects. The stronger representation is then trained to align with the frozen text representation.
  • Figure 3: Qualitative results of multi-object matching (left) and retrieving a single object with multiple attributes (right).
  • Figure 4: Qualitative results on cases where models struggle with multiple object-attribute compositionality (left). Cases where we see the most improvement and the least on single-object compositional retrieval are shown on the right.
  • Figure 5: The MAP numbers by the number of attributes in the query on the CLEVR dataset. Note how MM-Adapter performs well even as the number of attributes is gradually increased.
  • ...and 16 more figures