Table of Contents
Fetching ...

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

TL;DR

This work tackles zero-shot composed image retrieval (ZS-CIR) by reframing CIR as conditional latent-space editing with diffusion. It introduces CompoDiff, a Transformer-based denoiser operating in the CLIP latent space and trained in two stages to support diverse conditions (text, negative text, masks) with controllable inference via classifier-free guidance. A massive synthetic dataset, SynthTriplets18M (18.8M triplets), is built automatically through keyword-based captioning, LLM augmentation, and IP2P-inspired triplet expansion, enabling strong ZS-CIR performance across FashionIQ, CIRR, CIRCO, and GeneCIS. Results show state-of-the-art zero-shot CIR and notable controllability over condition strength and inference speed, with broader implications for scalable, flexible CIR in large-scale databases. The work also provides practical data-generation and filtering pipelines, facilitating broader adoption and future improvements in diffusion-based retrieval.

Abstract

This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

TL;DR

This work tackles zero-shot composed image retrieval (ZS-CIR) by reframing CIR as conditional latent-space editing with diffusion. It introduces CompoDiff, a Transformer-based denoiser operating in the CLIP latent space and trained in two stages to support diverse conditions (text, negative text, masks) with controllable inference via classifier-free guidance. A massive synthetic dataset, SynthTriplets18M (18.8M triplets), is built automatically through keyword-based captioning, LLM augmentation, and IP2P-inspired triplet expansion, enabling strong ZS-CIR performance across FashionIQ, CIRR, CIRCO, and GeneCIS. Results show state-of-the-art zero-shot CIR and notable controllability over condition strength and inference speed, with broader implications for scalable, flexible CIR in large-scale databases. The work also provides practical data-generation and filtering pipelines, facilitating broader adoption and future improvements in diffusion-based retrieval.

Abstract

This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff
Paper Structure (34 sections, 4 equations, 17 figures, 13 tables)

This paper contains 34 sections, 4 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Composed Image Retrieval (CIR) scenarios. (a) A standard CIR scenario. (b-d) Our versatile CIR scenarios with mixed conditions (e.g., negative text and mask). Results by CompoDiff on LAION-2B.
  • Figure 2: Comparisons of CIR methods. (a) Fusion-based methods (e.g., ARTEMIS delmas2022artemis and Combiner baldrati2022clip4cir) make a fused feature from image feature $z_{i_R}$ and text feature $z_c$. (b) Inversion-based methods (e.g., Pic2Word saito2023pic2word, SEARLE circo and LinCIR lincir) project $z_{i_R}$ into the text space, then perform text-to-image retrieval. (c) We apply a diffusion process to $z_{i_R}$ with classifier-free guidance by additional conditions. (b) and (c) use frozen encoders, and (a) usually tunes the encoders.
  • Figure 3: Training overview. Stage 1 is trained on LAION-2B with \ref{['eq:stage1-t2i']}. For stage 2, we alternatively update Denoising Transformer $\epsilon_\theta$ on LAION-2B with \ref{['eq:stage1-t2i']} and \ref{['eq:stage2-inpaint']} and SynthTriplets18M with \ref{['eq:stage2-triplet']}.
  • Figure 4: Inference overview. Using the denoising transformer $\varepsilon_\theta$ trained by Stage 1 and 2 (\ref{['fig:method_training_overview']}), we perform composed image retrieval (CIR). We use the classifier-free guidance by \ref{['eq:cfg']} to transform the input reference image to the target image feature, and perform image-to-image retrieval on the retrieval DB.
  • Figure 5: Inference condition control by varying $w_I$, $w_T$ in \ref{['eq:cfg']}.
  • ...and 12 more figures