Table of Contents
Fetching ...

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang

TL;DR

This work addresses zero-shot CIR by moving away from single-token projections toward a two-stage learning paradigm. Stage I focuses on mapping images to a rich pseudo-word token via a Visual Semantic Injection module and soft text alignment, establishing a strong image-to-word representation. Stage II introduces lightweight composing adapters and uses a small amount of synthetic data to train the model to fuse the pseudo-word with modification text, aided by hard negative mining. The method achieves state-of-the-art performance on FashionIQ, CIRR, and CIRCO, with notable gains even when synthetic data is scarce, demonstrating improved generalization and data efficiency for CIR tasks.

Abstract

Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

TL;DR

This work addresses zero-shot CIR by moving away from single-token projections toward a two-stage learning paradigm. Stage I focuses on mapping images to a rich pseudo-word token via a Visual Semantic Injection module and soft text alignment, establishing a strong image-to-word representation. Stage II introduces lightweight composing adapters and uses a small amount of synthetic data to train the model to fuse the pseudo-word with modification text, aided by hard negative mining. The method achieves state-of-the-art performance on FashionIQ, CIRR, and CIRCO, with notable gains even when synthetic data is scarce, demonstrating improved generalization and data efficiency for CIR tasks.

Abstract

Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.

Paper Structure

This paper contains 12 sections, 14 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Training and inference pipelines of the projection-based method. During training, the model focuses on mapping images to pseudo-word tokens, while during inference, the model needs to combine the pseudo-word token with real words to generate the composed query. "$\phi$" refers the mapping network, "$" indicates the pseudo-word token, and "$T_m$" denotes the modified text.
  • Figure 2: The paradigms of the current projection-based method: (a) The baseline approach, which primarily focuses on mapping images to pseudo-word tokens. (b) Method leveraging LLMs or diffusion models to generate large-scale synthetic data, training mapping, and composing simultaneously. (c) We propose a two-stage framework that decouples the learning process into mapping and composing stages. It enhances the model’s compositional understanding capability using only a small amount of synthetic data. "TL" represents the designed token learner module, "VSI" and "CA" denote the visual semantic injection module and composed adapter, respectively.
  • Figure 3: The framework of our proposed method comprises two stages: mapping learning (left) and composing learning (right). In the first stage, to comprehensively capture visual information, we introduce the visual semantic inject module (VSI), which can be integrated into various layers of the text encoder to continuously inject visual semantics. Additionally, a soft text alignment loss is applied to ensure the pseudo-word token aligns well with real words. In the second stage, we incorporate several composing adapters (CA) and adopt a hard negative mining strategy to optimize the text encoder, enabling it to effectively encode the composed query by combining the pseudo-word token with the modification text.
  • Figure 4: Retrieved results of "A photo of $" in Stage I on Fishion-IQ (left) and CIRR (right). "$" indicates the pseudo-word token generated by the query image.
  • Figure 5: Illustration of composed image retrieval on Fashion-IQ and CIRR.