Table of Contents
Fetching ...

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Junyang Chen, Hanjiang Lai

TL;DR

This work tackles zero-shot composed image retrieval (ZS-CIR) by addressing the gap between pre-trained vision-language models and CIR tasks. It introduces a simple, self-supervised masked tuning strategy that converts image-text pairs into CIR-like triplets $<I^m, T, I>$ by masking image patches, training the model to learn text-guided modifications. The approach yields substantial improvements across FashionIQ, CIRR, CIRCO, and GeneCIS, using both CLIP and BLIP backbones, and remains effective when backbones are frozen or extended with lightweight combiners. This masked tuning framework reduces reliance on costly triplet annotations and demonstrates strong practical impact for scalable, zero-shot CIR in real-world retrieval tasks.

Abstract

Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

TL;DR

This work tackles zero-shot composed image retrieval (ZS-CIR) by addressing the gap between pre-trained vision-language models and CIR tasks. It introduces a simple, self-supervised masked tuning strategy that converts image-text pairs into CIR-like triplets by masking image patches, training the model to learn text-guided modifications. The approach yields substantial improvements across FashionIQ, CIRR, CIRCO, and GeneCIS, using both CLIP and BLIP backbones, and remains effective when backbones are frozen or extended with lightweight combiners. This masked tuning framework reduces reliance on costly triplet annotations and demonstrates strong practical impact for scalable, zero-shot CIR in real-world retrieval tasks.

Abstract

Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate masked image, text, image triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI
Paper Structure (23 sections, 1 equation, 5 figures, 7 tables)

This paper contains 23 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Workflow of composed image retrieval task, which is essentially different from the pre-trained VLM objective and textual inversion. (b) The pre-trained vision-language models li2022blipradford2021clip are to align text and image features. (c) The recent ZS-CIR methods Baldrati_2023_ICCVsaito2023pic2word also introduced textual inversion into the pre-trained VLM, which mapped the reference image into the text domain, to further improve the performance.
  • Figure 2: Overview of our masked pre-training method. Left: we randomly apply a high masking ratio to mask image patches, and let the pre-trained task approximate the CIR task. Right: we leverage the pre-trained model at inference time on ZS-CIR.
  • Figure 3: Quantitative results of BLIP ViT-B/16 backbone. We compare the proposed method with LinCIR gu2023languageonly on four benchmarks. The performance on different datasets is evaluated using the average R@10 for FashionIQ and CIRR, mAP@5 for CIRCO, and average R@3 for GeneCIS.
  • Figure 4: Influence of masking ratio $w$ on the FashionIQ dataset with different backbone: (a) CLIP B/32, (b) BLIP B/16.
  • Figure 5: Top-3 examples retrieved from CIRCO validation set. Ground truth retrievals are highlighted with red outline. We mainly compare the top-3 retrieved results of proposed method with the previous SOTA model SEARLE Baldrati_2023_ICCV.