Table of Contents
Fetching ...

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

Haokun Wen, Xuemeng Song, Xiaolin Chen, Yinwei Wei, Liqiang Nie, Tat-Seng Chua

TL;DR

This work tackles composed image retrieval (CIR) by addressing the limitation of nonlinear feature-level fusion in vision-language pre-trained (VLP) models. It introduces DQU-CIR, a dual query unification framework that converts a multimodal query into unified textual and visual queries using training-free components (text-oriented via captioning and vision-oriented via keyword insertion), followed by a linear adaptive fusion to stay within the VLP embedding space. The approach achieves state-of-the-art results on fashion-domain datasets (FashionIQ, Shoes, Fashion200K) and competitive performance on CIRR, with ablations confirming the complementary strengths of the two unification strategies and the effectiveness of the fusion mechanism. Overall, the paper demonstrates that raw-data level fusion leverages VLP cross-modal encoding and OCR capabilities to improve CIR performance while maintaining a lightweight, training-free component setup.

Abstract

Composed image retrieval (CIR) aims to retrieve the target image based on a multimodal query, i.e., a reference image paired with corresponding modification text. Recent CIR studies leverage vision-language pre-trained (VLP) methods as the feature extraction backbone, and perform nonlinear feature-level multimodal query fusion to retrieve the target image. Despite the promising performance, we argue that their nonlinear feature-level multimodal fusion may lead to the fused feature deviating from the original embedding space, potentially hurting the retrieval performance. To address this issue, in this work, we propose shifting the multimodal fusion from the feature level to the raw-data level to fully exploit the VLP model's multimodal encoding and cross-modal alignment abilities. In particular, we introduce a Dual Query Unification-based Composed Image Retrieval framework (DQU-CIR), whose backbone simply involves a VLP model's image encoder and a text encoder. Specifically, DQU-CIR first employs two training-free query unification components: text-oriented query unification and vision-oriented query unification, to derive a unified textual and visual query based on the raw data of the multimodal query, respectively. The unified textual query is derived by concatenating the modification text with the extracted reference image's textual description, while the unified visual query is created by writing the key modification words onto the reference image. Ultimately, to address diverse search intentions, DQU-CIR linearly combines the features of the two unified queries encoded by the VLP model to retrieve the target image. Extensive experiments on four real-world datasets validate the effectiveness of our proposed method.

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

TL;DR

This work tackles composed image retrieval (CIR) by addressing the limitation of nonlinear feature-level fusion in vision-language pre-trained (VLP) models. It introduces DQU-CIR, a dual query unification framework that converts a multimodal query into unified textual and visual queries using training-free components (text-oriented via captioning and vision-oriented via keyword insertion), followed by a linear adaptive fusion to stay within the VLP embedding space. The approach achieves state-of-the-art results on fashion-domain datasets (FashionIQ, Shoes, Fashion200K) and competitive performance on CIRR, with ablations confirming the complementary strengths of the two unification strategies and the effectiveness of the fusion mechanism. Overall, the paper demonstrates that raw-data level fusion leverages VLP cross-modal encoding and OCR capabilities to improve CIR performance while maintaining a lightweight, training-free component setup.

Abstract

Composed image retrieval (CIR) aims to retrieve the target image based on a multimodal query, i.e., a reference image paired with corresponding modification text. Recent CIR studies leverage vision-language pre-trained (VLP) methods as the feature extraction backbone, and perform nonlinear feature-level multimodal query fusion to retrieve the target image. Despite the promising performance, we argue that their nonlinear feature-level multimodal fusion may lead to the fused feature deviating from the original embedding space, potentially hurting the retrieval performance. To address this issue, in this work, we propose shifting the multimodal fusion from the feature level to the raw-data level to fully exploit the VLP model's multimodal encoding and cross-modal alignment abilities. In particular, we introduce a Dual Query Unification-based Composed Image Retrieval framework (DQU-CIR), whose backbone simply involves a VLP model's image encoder and a text encoder. Specifically, DQU-CIR first employs two training-free query unification components: text-oriented query unification and vision-oriented query unification, to derive a unified textual and visual query based on the raw data of the multimodal query, respectively. The unified textual query is derived by concatenating the modification text with the extracted reference image's textual description, while the unified visual query is created by writing the key modification words onto the reference image. Ultimately, to address diverse search intentions, DQU-CIR linearly combines the features of the two unified queries encoded by the VLP model to retrieve the target image. Extensive experiments on four real-world datasets validate the effectiveness of our proposed method.
Paper Structure (18 sections, 7 equations, 6 figures, 5 tables)

This paper contains 18 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between existing methods and ours.
  • Figure 2: Performance comparison of our method with state-of-the-art baseline methods on two public datasets.
  • Figure 3: The proposed DQU-CIR consists of three components: (a) text-oriented query unification, (b) vision-oriented query unification, and (c) linear adaptive fusion-based target retrieval.
  • Figure 4: Illustration of our designed prompt and an example for the key words extraction.
  • Figure 5: Performance on FashionIQ-Avg (VAL-Split) and Shoes of DQU-CIR with different versions of CLIP. The horizontal dashed lines denote the best baseline performance. The random seeds are fixed at $42$ across the experiments.
  • ...and 1 more figures