Table of Contents
Fetching ...

Zero-shot Composed Text-Image Retrieval

Yikun Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

This work tackles composed image retrieval (CIR) by fusing visual and textual cues to retrieve target images in a zero-shot setting. It introduces a scalable dataset construction pipeline that leverages large image-caption corpora and a transformer-based adaptive fusion model, TransAgg, to build and fuse multimodal representations. Trained entirely on automatically constructed data, the model achieves state-of-the-art or competitive zero-shot results on CIRR and FashionIQ, demonstrating substantial data-efficiency versus fully supervised methods. The approach offers a practical path to scalable CIR systems with minimal manual labeling and broad applicability to multimodal retrieval tasks.

Abstract

In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/

Zero-shot Composed Text-Image Retrieval

TL;DR

This work tackles composed image retrieval (CIR) by fusing visual and textual cues to retrieve target images in a zero-shot setting. It introduces a scalable dataset construction pipeline that leverages large image-caption corpora and a transformer-based adaptive fusion model, TransAgg, to build and fuse multimodal representations. Trained entirely on automatically constructed data, the model achieves state-of-the-art or competitive zero-shot results on CIRR and FashionIQ, demonstrating substantial data-efficiency versus fully supervised methods. The approach offers a practical path to scalable CIR systems with minimal manual labeling and broad applicability to multimodal retrieval tasks.

Abstract

In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/
Paper Structure (24 sections, 5 equations, 5 figures, 14 tables)

This paper contains 24 sections, 5 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: An overview of our proposed architecture, that consists of a visual encoder, a text encoder, a Transformer module and an adaptive aggregation module.
  • Figure 2: An overview of our proposed dataset construction procedure, based on sentence template (left), or large language models (right).
  • Figure 3: Failure cases of dataset construction. The edited caption and target image caption in the first row have a high similarity score, but their semantic meanings are significantly different. In the second row, we intend to retrieve a red watering can, but a mental watering can is mistakenly retrieved instead. In the third row, the numerical values in both reference image caption and target image caption are incorrect.
  • Figure 4: Qualitative results on CIRR. From left to right are the reference image, relative caption and the top five retrieved images. The ground truth is marked with a red box.
  • Figure 5: Explainability heatmaps for CIR task. From left to right are the heatmap, reference image, relative caption and the target image. The heatmap is calculated through the attention between the bolded token in the relative caption and other image patches.