Table of Contents
Fetching ...

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna

TL;DR

CREPE targets a long-standing gap in vision-language research by proposing a large-scale, compositionality-focused benchmark. It formalizes a scene-graph-based language to assess two facets—systematicity and productivity—via retrieval tasks with hard negatives across three large pretraining datasets. The experiments reveal consistent drops in compositional generalization as compounds become unseen and as caption complexity increases, with model size and data scale failing to guarantee improvements. By providing scalable datasets and a rigorous evaluation protocol, CREPE offers a concrete path to measuring and ultimately improving compositional reasoning in vision-language foundations.

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

TL;DR

CREPE targets a long-standing gap in vision-language research by proposing a large-scale, compositionality-focused benchmark. It formalizes a scene-graph-based language to assess two facets—systematicity and productivity—via retrieval tasks with hard negatives across three large pretraining datasets. The experiments reveal consistent drops in compositional generalization as compounds become unseen and as caption complexity increases, with model size and data scale failing to guarantee improvements. By providing scalable datasets and a rigorous evaluation protocol, CREPE offers a concrete path to measuring and ultimately improving compositional reasoning in vision-language foundations.

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate , , and hard negative captions for a subset of the pairs. To test productivity, CREPE contains image-text pairs with nine different complexities plus hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to . For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
Paper Structure (44 sections, 13 figures, 17 tables)

This paper contains 44 sections, 13 figures, 17 tables.

Figures (13)

  • Figure 1: We introduce CREPE, a benchmark to evaluate whether vision-language foundation models demonstrate two fundamental aspects of compositionality: systematicity and productivity. To evaluate systematicity, CREPE utilizes Visual Genome and introduces three new test datasets for the three popular pretraining datasets: CC-12M, YFCC-15M, and LAION-400M. These enable evaluating models' abilities to systematically generalize their understanding to seen compounds, unseen compounds, and even unseen atoms. To evaluate productivity, CREPE introduces examples of nine complexities, with three types of hard negatives for each.
  • Figure 2: An overview of the systematicity retrieval set generation process. First, a model's image-caption training set is parsed to identify what atoms and compounds the model has seen. Then, an evaluation set is divided into three compositional splits according to whether the model has seen all the compounds (Seen Compounds), only all the atoms of the caption (Unseen Compounds), or neither (Unseen Atoms). Finally, hard negative captions $\textsc{HN-Atom}$ and $\textsc{HN-Comp}$ are generated for the hard negatives retrieval set $\mathcal{D}_{test}^{HN}$ .
  • Figure 3: An overview of the productivity retrieval set generation process. By performing random walks on the scene graphs of an evaluation dataset, we generate subgraphs of various complexities. Then, for complexities $n \in \{4, ,5 \dots, 12\}$ and three hard negative types, we populate the retrieval set $\mathcal{D}_{test}^{HN}$ by generating a ground truth caption for each $n$-subgraph and hard negatives for each caption.
  • Figure 4: Systematicity analysis. We plot models' recall@1 on the Seen Compounds vs. Unseen Compounds split of the systematicity retrieval set with hard negatives HN-Atom, HN-Comp and both types. We observe a consistent drop in models' performance from the SC to UC split when the retrieval set contains HN-Atom or both types, and little to no difference when it contains only HN-Comp.
  • Figure 5: Productivity Analysis. We plot models' Recall@1 on the hard negatives retrieval set against complexity, averaged across all models pretrained on all three training datasets. We find that models' ability to retrieve the ground-truth degrades as complexity increases.
  • ...and 8 more figures