Table of Contents
Fetching ...

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee

TL;DR

CounterCurate targets two under-explored facets of visio-linguistic reasoning: physically grounded compositional reasoning (e.g., counting and spatial relations) and semantic counterfactual fine-tuning using powerful generators. By constructing grounded negative examples with Flickr30k Entities, GLIGEN, and simple augmentations, and by leveraging GPT-4V and DALLE-3 to produce challenging semantic counterfactuals, CounterCurate yields substantial improvements for both CLIP and LLaVA on position, counting, and SugarCrepe benchmarks. The ablations confirm that combining negative images, negative captions, and grouping is essential, and the approach preserves zero-shot capabilities while enhancing specialized reasoning tasks. The authors release code, datasets, benchmarks, and checkpoints to enable replication and further research in multimodal compositional reasoning.

Abstract

We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io.

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

TL;DR

CounterCurate targets two under-explored facets of visio-linguistic reasoning: physically grounded compositional reasoning (e.g., counting and spatial relations) and semantic counterfactual fine-tuning using powerful generators. By constructing grounded negative examples with Flickr30k Entities, GLIGEN, and simple augmentations, and by leveraging GPT-4V and DALLE-3 to produce challenging semantic counterfactuals, CounterCurate yields substantial improvements for both CLIP and LLaVA on position, counting, and SugarCrepe benchmarks. The ablations confirm that combining negative images, negative captions, and grouping is essential, and the approach preserves zero-shot capabilities while enhancing specialized reasoning tasks. The authors release code, datasets, benchmarks, and checkpoints to enable replication and further research in multimodal compositional reasoning.

Abstract

We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io.
Paper Structure (36 sections, 2 equations, 6 figures, 12 tables)

This paper contains 36 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Representative examples of GPT-4V failure cases. In both questions, GPT-4V correctly identifies all objects in question, but chooses the wrong answer because it fails to distinguish between either left and right (the left question) or up and down (the right question).
  • Figure 2: The data curation pipeline of CounterCurate. Given a positive image-caption pair, we first generate the negative captions, based on which we curate the negative images using the most suitable approach. Specifically, (a) for Flickr30k-Positions (left/right), we flip the positional keyword before conducting the horizontal flip for the image; (b) for Flickr30k-Counting, we manipulate the digit before applying grounded image inpainting li2023gligen as the negative image; (c) for Flickr30k-Attributes, we first leverage GPT-4V openai2023GPT4V to generate reasonable hard negative captions for replacing the noun, replacing the adjective, and swapping the adjectives. Then we leverage DALLE-3 openai2023DALLE3 to generate coherent images.
  • Figure 3: Fine-tuning different types of large multimodal models with CounterCurate. Our pipeline can enhance both contrastive learning models and generative models by augmenting vanilla image-caption pairs with curated negative images and captions. Specifically, our counterfactual image-caption pairs (a) provide auxiliary contrastive loss for models like CLIP, where positive contrastive units in the similarity matrix are colored as blue/red and negative ones are colored as white, and (b) can be naturally integrated into the original next-token prediction loss in text generation models such as LLaVA.
  • Figure 4: To generate the correct above-below negative image via GLIGEN with the original prompt "the ball is below the sports outfit", we recenter the bounding boxes of "ball" and "sports outfit" and feed them into GLIGEN together with an expanded prompt from GPT4.
  • Figure 5: Qualitative examples of models' compositional reasoning capabilities before/after being finetuned via our approach CounterCurate. Wrong answers are marked in red. Our approach enhances both CLIP and LLaVA's reasoning capabilities.
  • ...and 1 more figures