Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval
Junyang Chen, Hanjiang Lai
TL;DR
This work tackles zero-shot composed image retrieval (ZS-CIR) by addressing the gap between pre-trained vision-language models and CIR tasks. It introduces a simple, self-supervised masked tuning strategy that converts image-text pairs into CIR-like triplets $<I^m, T, I>$ by masking image patches, training the model to learn text-guided modifications. The approach yields substantial improvements across FashionIQ, CIRR, CIRCO, and GeneCIS, using both CLIP and BLIP backbones, and remains effective when backbones are frozen or extended with lightweight combiners. This masked tuning framework reduces reliance on costly triplet annotations and demonstrates strong practical impact for scalable, zero-shot CIR in real-world retrieval tasks.
Abstract
Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI
