Table of Contents
Fetching ...

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju

TL;DR

This paper tackles zero-shot compositional image retrieval (CIR) by proposing SCOT, a self-supervised pretraining framework that learns to compose a reference image with a modification text while using an LLM to generate text-based supervision signals. It leverages large-scale contrastive vision–language encoders (e.g., CLIP/BLIP-2) and a trainable Combiner for fusion, with a loss that brings the composed image embedding $\mathcal{V}_c$ close to the modified caption embedding $\mathcal{T}_u$ and pushes it away from negatives, including hard negatives and original captions. By generating 290K text triplets from captions and optimizing on a contrastive objective, SCOT achieves state-of-the-art zero-shot results on FashionIQ and CIRR and demonstrates strong open-world generalization while reducing the need for domain-specific triplet annotations. The approach also provides insights into the impact of backbones, dataset size, and supervision type, showing that text supervision can outperform retrieved-image targets and that better backbones yield larger gains. Overall, SCOT offers a scalable, annotation-free pathway to effective zero-shot CIR with practical implications for open-world retrieval tasks.

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

TL;DR

This paper tackles zero-shot compositional image retrieval (CIR) by proposing SCOT, a self-supervised pretraining framework that learns to compose a reference image with a modification text while using an LLM to generate text-based supervision signals. It leverages large-scale contrastive vision–language encoders (e.g., CLIP/BLIP-2) and a trainable Combiner for fusion, with a loss that brings the composed image embedding close to the modified caption embedding and pushes it away from negatives, including hard negatives and original captions. By generating 290K text triplets from captions and optimizing on a contrastive objective, SCOT achieves state-of-the-art zero-shot results on FashionIQ and CIRR and demonstrates strong open-world generalization while reducing the need for domain-specific triplet annotations. The approach also provides insights into the impact of backbones, dataset size, and supervision type, showing that text supervision can outperform retrieved-image targets and that better backbones yield larger gains. Overall, SCOT offers a scalable, annotation-free pathway to effective zero-shot CIR with practical implications for open-world retrieval tasks.

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
Paper Structure (13 sections, 7 equations, 7 figures, 3 tables)

This paper contains 13 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Compositional image retrieval methods typically require domain-specific image-text-image triplets for training and cannot generalize to unseen domains. In contrast, SCOT uses existing large noisy captioned image datasets for compositional training and demonstrates zero-shot generalizability to new domains.
  • Figure 2: SCOT pretraining and inference.Left: The composition function $f_c$ is trained using existing image-caption datasets, a frozen image-text encoder (such as CLIP), and a frozen large language model (LLM). The LLM generates the modification text $m$ and a modified caption $u$. The reference image embedding $\mathcal{V}$ and the modification text embedding $\mathcal{T}_m$ are passed to $f_c$ to get the composed embedding $\mathcal{V}_c$. We optimize the parameters of $f_c$ to draw $\mathcal{V}_c$ towards the modified caption $\mathcal{T}_u$ and away from the original caption $\mathcal{T}$. The full loss also pushes $\mathcal{V}_c$ away from the embeddings of other (non-matching) modified captions within each batch (not illustrated here). Right: During inference, we compute the similarity between the composed embedding and the embeddings of gallery images to retrieve the target image.
  • Figure 3: LLM-generated text triplet samples, showing appropriate modifications over different image domains.
  • Figure 4: Qualitative retrieval results on validation sets. Top: FashionIQ fashionIQ. Bottom: CIRR CIRR. A green box indicates the correctly retrieved image. For CIRR, the rightmost column illustrates the corresponding modality weight learned by SCOT for that example. (Best viewed in color.)
  • Figure 5: Gains relative to Text-Only. Difference in recall ($\mathrm{\Delta}$R) between methods and the backbone-matched Text-Only baseline.
  • ...and 2 more figures