Table of Contents
Fetching ...

Data-Efficient Generalization for Zero-shot Composed Image Retrieval

Zining Chen, Zhicheng Zhao, Fei Su, Xiaoqin Zhang, Shijian Lu

TL;DR

This work tackles zero-shot composed image retrieval (ZS-CIR) by addressing two core challenges: modality discrepancy between training-time image-text alignment and inference-time composition, and distribution shift that leads to overfitting. It introduces Data-efficient Generalization (DeG), comprising a Textual Supplement (TS) that enriches pseudo-word semantics through training with complementary textual tokens, and a Semantic Set (S-Set) that exploits the zero-shot capabilities of pretrained vision-language models to improve generalization. In inference, DeG blends TS-derived semantics with the standard pseudo-word to form robust prompts, avoiding external models during retrieval. Across four ZS-CIR benchmarks, DeG achieves state-of-the-art results with only a fraction of the training data and reduced computational costs, demonstrating strong practical impact for scalable multimodal retrieval.

Abstract

Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set). The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The S-Set exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.

Data-Efficient Generalization for Zero-shot Composed Image Retrieval

TL;DR

This work tackles zero-shot composed image retrieval (ZS-CIR) by addressing two core challenges: modality discrepancy between training-time image-text alignment and inference-time composition, and distribution shift that leads to overfitting. It introduces Data-efficient Generalization (DeG), comprising a Textual Supplement (TS) that enriches pseudo-word semantics through training with complementary textual tokens, and a Semantic Set (S-Set) that exploits the zero-shot capabilities of pretrained vision-language models to improve generalization. In inference, DeG blends TS-derived semantics with the standard pseudo-word to form robust prompts, avoiding external models during retrieval. Across four ZS-CIR benchmarks, DeG achieves state-of-the-art results with only a fraction of the training data and reduced computational costs, demonstrating strong practical impact for scalable multimodal retrieval.

Abstract

Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set). The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The S-Set exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.

Paper Structure

This paper contains 16 sections, 19 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of training and inference time between different paradigms. Previous methods from vision-language pretraining paradigm requires no external models, but consumes high training costs due to large-scale image-text dataset, while methods from the triplet generation paradigm either consume huge training overhead or inference latency due to training and inference generation by large external models.
  • Figure 2: Comparison between the data from the training and inference set. For the modality discrepancy, the visual and textual information during training is aligned, while compositional during inference. Meanwhile, the distribution shift of both image and text between the training and inference dataset is substantial.
  • Figure 3: The overall framework of our method DeG. The left part presents the proposed modules, including the Textual Supplement (TS) module and the Semantic Set (S-Set) with novel training objectives. The right part illustrates how the S-Set is selected with two conditions, where the CLIP predicted caption for the image is incorrect but has a similar ground truth caption.
  • Figure 4: The inference process of our method DeG.
  • Figure 5: Experimental results of hyper-parameters on average performance of R@10 for Fashion-IQ validation set and mAP@5 for CIRCO dataset.
  • ...and 1 more figures