Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

Yucheng Suo; Fan Ma; Linchao Zhu; Yi Yang

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang

TL;DR

This work tackles zero-shot composed image retrieval by introducing Knowledge-Enhanced Dual Stream (KEDs), which combines a Bi-modality Knowledge-guided Projection (BKP) that leverages an external image–caption database to enrich pseudo-word tokens with fine-grained attributes, with a second dual-stream branch that aligns pseudo-word tokens to textual concepts via pseudo-triplets mined from image–caption pairs. During inference, the model fuses the two streams to form a robust composed feature for retrieval, enabling strong generalization across datasets without triplet annotations. Extensive experiments across ImageNet‑R, COCO, Fashion‑IQ, and CIRR demonstrate state-of-the-art zero-shot CIR performance, highlighting improved attribute understanding and cross-domain adaptability. The approach offers practical impact for flexible, fine-grained image retrieval in diverse domains, with the potential for future enhancement via language-model–driven textual descriptions.

Abstract

We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives. Moreover, KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space. Extensive experiments on widely used benchmarks, i.e. ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms previous zero-shot composed image retrieval methods.

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 7 figures, 5 tables)

This paper contains 17 sections, 5 equations, 7 figures, 5 tables.

Introduction
Related Work
Composed Image Retrieval
Knowledge Enhanced Methods
Vision-language Pretraining
Method
Preliminaries
Bi-modality Knowledge-guided Projection
Dual-stream Semantic Alignment
Hybrid Inference
Experiments
Datasets and Setup
Implementation Details
Quantitative and Qualitative Results
Ablation Studies
...and 2 more sections

Figures (7)

Figure 1: Comparison between existing methods and KEDs. Pic2word saito2023pic2word learns the mapping network using image-only contrastive learning and generates pseudo work $\phi_M$ token $v$. We propose to augment the pseudo-word token with external knowledge. In addition, we introduce an extra branch $\phi_A$ sharing architecture with $\phi_M$ for textual concept alignment. Note that $f_v$ and $f_t$ indicate frozen CLIP visual encoder and text encoder respectively.
Figure 2: Overall framework of KEDs. The left part of the figure represents the dual-stream training of KEDs, consisting of the image-only contrastive training (orange) and textual concept alignment branch (blue). The right part represents the architecture of the proposed Bi-modality Knowledge-guided projection.
Figure 3: A simple illustration of the inference process of KEDs.
Figure 4: Qualitative results on Fashion-IQ dataset. Images with green borders represent the ground truth.
Figure 5: Qualitative results on CIRR dataset. Images with green borders represent the ground truth.
...and 2 more figures

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

TL;DR

Abstract

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (7)