Table of Contents
Fetching ...

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

Xin Wang, Haipeng Zhang, Mang Li, Zhaohui Xia, Yueguo Chen, Yu Zhang, Chunyu Wei

TL;DR

<3-5 sentence high-level summary>

Abstract

Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

TL;DR

<3-5 sentence high-level summary>

Abstract

Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.

Paper Structure

This paper contains 54 sections, 19 equations, 8 figures, 4 tables, 4 algorithms.

Figures (8)

  • Figure 1: Comparison of different paradigms for zero-shot composed image retrieval. (a) Text-centric methods employ textual inversion or LLM-based description generation to perform retrieval in text space, discarding visual information. (b) Visual-assisted methods synthesize pseudo visual features via diffusion in visual space but still rely on text-centric retrieval mechanisms. (c) Fusion-Diff (Ours) operates directly in the joint vision-language space, modeling the distribution of target-fused embeddings to enable multimodal-to-multimodal retrieval, fundamentally addressing the modality gap.
  • Figure 2: The Framework of Fusion-Diff.
  • Figure 3: Visualization results on CIRR test set.
  • Figure 4: Visualization results on FashionIQ validation set.
  • Figure 5: Parameter sensitivity analysis on CIRR test set and FashionIQ validation set.
  • ...and 3 more figures