SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval
Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju
TL;DR
This paper tackles zero-shot compositional image retrieval (CIR) by proposing SCOT, a self-supervised pretraining framework that learns to compose a reference image with a modification text while using an LLM to generate text-based supervision signals. It leverages large-scale contrastive vision–language encoders (e.g., CLIP/BLIP-2) and a trainable Combiner for fusion, with a loss that brings the composed image embedding $\mathcal{V}_c$ close to the modified caption embedding $\mathcal{T}_u$ and pushes it away from negatives, including hard negatives and original captions. By generating 290K text triplets from captions and optimizing on a contrastive objective, SCOT achieves state-of-the-art zero-shot results on FashionIQ and CIRR and demonstrates strong open-world generalization while reducing the need for domain-specific triplet annotations. The approach also provides insights into the impact of backbones, dataset size, and supervision type, showing that text supervision can outperform retrieved-image targets and that better backbones yield larger gains. Overall, SCOT offers a scalable, annotation-free pathway to effective zero-shot CIR with practical implications for open-world retrieval tasks.
Abstract
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
