SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Bhavin Jawade; Joao V. B. Soares; Kapil Thadani; Deen Dayal Mohan; Amir Erfan Eshratifar; Benjamin Culpepper; Paloma de Juan; Srirangaraj Setlur; Venu Govindaraju

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju

TL;DR

This paper tackles zero-shot compositional image retrieval (CIR) by proposing SCOT, a self-supervised pretraining framework that learns to compose a reference image with a modification text while using an LLM to generate text-based supervision signals. It leverages large-scale contrastive vision–language encoders (e.g., CLIP/BLIP-2) and a trainable Combiner for fusion, with a loss that brings the composed image embedding $\mathcal{V}_c$ close to the modified caption embedding $\mathcal{T}_u$ and pushes it away from negatives, including hard negatives and original captions. By generating 290K text triplets from captions and optimizing on a contrastive objective, SCOT achieves state-of-the-art zero-shot results on FashionIQ and CIRR and demonstrates strong open-world generalization while reducing the need for domain-specific triplet annotations. The approach also provides insights into the impact of backbones, dataset size, and supervision type, showing that text supervision can outperform retrieved-image targets and that better backbones yield larger gains. Overall, SCOT offers a scalable, annotation-free pathway to effective zero-shot CIR with practical implications for open-world retrieval tasks.

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

TL;DR

close to the modified caption embedding

and pushes it away from negatives, including hard negatives and original captions. By generating 290K text triplets from captions and optimizing on a contrastive objective, SCOT achieves state-of-the-art zero-shot results on FashionIQ and CIRR and demonstrates strong open-world generalization while reducing the need for domain-specific triplet annotations. The approach also provides insights into the impact of backbones, dataset size, and supervision type, showing that text supervision can outperform retrieved-image targets and that better backbones yield larger gains. Overall, SCOT offers a scalable, annotation-free pathway to effective zero-shot CIR with practical implications for open-world retrieval tasks.

Abstract

Paper Structure (13 sections, 7 equations, 7 figures, 3 tables)

This paper contains 13 sections, 7 equations, 7 figures, 3 tables.

Introduction
Related Work
Method
Large-Scale Contrastive Pretraining
Self-Supervised Compositional Pretraining
Training Objective
Inference
Experiments
Datasets
Implementation Details
Comparison with state-of-the-art methods
Discussion
Conclusion

Figures (7)

Figure 1: Compositional image retrieval methods typically require domain-specific image-text-image triplets for training and cannot generalize to unseen domains. In contrast, SCOT uses existing large noisy captioned image datasets for compositional training and demonstrates zero-shot generalizability to new domains.
Figure 2: SCOT pretraining and inference.Left: The composition function $f_c$ is trained using existing image-caption datasets, a frozen image-text encoder (such as CLIP), and a frozen large language model (LLM). The LLM generates the modification text $m$ and a modified caption $u$. The reference image embedding $\mathcal{V}$ and the modification text embedding $\mathcal{T}_m$ are passed to $f_c$ to get the composed embedding $\mathcal{V}_c$. We optimize the parameters of $f_c$ to draw $\mathcal{V}_c$ towards the modified caption $\mathcal{T}_u$ and away from the original caption $\mathcal{T}$. The full loss also pushes $\mathcal{V}_c$ away from the embeddings of other (non-matching) modified captions within each batch (not illustrated here). Right: During inference, we compute the similarity between the composed embedding and the embeddings of gallery images to retrieve the target image.
Figure 3: LLM-generated text triplet samples, showing appropriate modifications over different image domains.
Figure 4: Qualitative retrieval results on validation sets. Top: FashionIQ fashionIQ. Bottom: CIRR CIRR. A green box indicates the correctly retrieved image. For CIRR, the rightmost column illustrates the corresponding modality weight learned by SCOT for that example. (Best viewed in color.)
Figure 5: Gains relative to Text-Only. Difference in recall ($\mathrm{\Delta}$R) between methods and the backbone-matched Text-Only baseline.
...and 2 more figures

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

TL;DR

Abstract

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (7)