Table of Contents
Fetching ...

ComCLIP: Training-Free Compositional Image and Text Matching

Kenan Jiang, Xuehai He, Ruize Xu, Xin Eric Wang

TL;DR

ComCLIP introduces a training-free, causally grounded approach to compositional image-text matching that disentangles images into subject, object, and predicate subimages and aggregates their embeddings with the global CLIP representation. By applying backdoor-adjustment-inspired interventions and counterfactual subimage generation, it mitigates spurious correlations learned during pretraining and improves zero-shot compositional generalization across multiple benchmarks, including Winoground, VL-checklist, SVO-Probes, and the newly created ComVG dataset. The method is plug-and-play with CLIP-like models and shows competitive gains on general image-text retrieval tasks (Flickr30K, MSCOCO) while delivering notable improvements in compositional tasks. The work also provides extensive ablations and qualitative analyses, demonstrating robustness to different subimage generators and parsing pipelines, and introduces ComVG to benchmark compositional reasoning in vision-language systems.

Abstract

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.

ComCLIP: Training-Free Compositional Image and Text Matching

TL;DR

ComCLIP introduces a training-free, causally grounded approach to compositional image-text matching that disentangles images into subject, object, and predicate subimages and aggregates their embeddings with the global CLIP representation. By applying backdoor-adjustment-inspired interventions and counterfactual subimage generation, it mitigates spurious correlations learned during pretraining and improves zero-shot compositional generalization across multiple benchmarks, including Winoground, VL-checklist, SVO-Probes, and the newly created ComVG dataset. The method is plug-and-play with CLIP-like models and shows competitive gains on general image-text retrieval tasks (Flickr30K, MSCOCO) while delivering notable improvements in compositional tasks. The work also provides extensive ablations and qualitative analyses, demonstrating robustness to different subimage generators and parsing pipelines, and introduces ComVG to benchmark compositional reasoning in vision-language systems.

Abstract

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.
Paper Structure (37 sections, 4 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 37 sections, 4 equations, 16 figures, 11 tables, 1 algorithm.

Figures (16)

  • Figure 1: Examples of the compositional image-text matching problem, in which the positive and negative images have very similar semantics except for the only difference in subject, predicate/verb, or object. CLIP mistakenly connects the text prompts with the wrong images on the right (high similarity scores with negative images), while our ComCLIP model does compositional matching more effectively.
  • Figure 2: Overview of our ComCLIP framework using CLIP as the backbone. We disentangle the input image using GRiT wu2022grit and the Large Language Model (LLM) by obeying the rules of encoding object, subject, and predicate respectively. The figure shows the case where multiple subjects/objects/predicates are involved (this is a positive example from Flickr30K).
  • Figure 3: Overview of our ComCLIP framework using CLIP as the backbone. We disentangle the input image using three independent encoding mechanisms by obeying the rules of encoding object, subject, and predicate respectively. The entity information is introduced to the global embedding of the whole image. Module components from CLIP (vision encoder $F(\cdot)$, text encoder $G(\cdot)$) are always frozen. During implementation, the process for matching and calculating the score begins with the input image being processed into object, subject, and predicate sub-images. This is followed by feeding both the original sentence and image, along with their parsed words and sub-images, into the CLIP text and vision encoders. Subsequently, cosine similarity scores are computed for each pairing of sub-image and word embeddings. These scores are then subjected to a Softmax layer, resulting in three positive weights. The next step involves adding the reweighted sub-image embeddings to the embedding of the original image. Finally, the ultimate matching score is derived from comparing this aggregated image embedding and the global text embedding.The whole framework is training-free.
  • Figure 4: Comparison of Recall@1 (%) and Recall@5 (%) using CLIP and ComCLIP over the general image-text retrieval datasets.
  • Figure 5: Examples of the generated subject, object, and predicate subimages. The first and third rows correspond to positive images and individual outputs of each IM for different entities. The second and fourth rows correspond to negative ones. Top two rows: examples from the ComVG dataset. (Woman, carrying, skateboard) is used as input (subject, predicate, object) to each IM. Bottom two rows: examples from the SVO-Probes dataset. (Cat, sits, table) is used as input to each IM. Note that for negative images, when IM could not accept the given (subject, predicate, object) and generate output subimages, the subimage is replaced with the original image for entity composition.
  • ...and 11 more figures