Table of Contents
Fetching ...

good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval

Pranavi Kolouju, Eric Xing, Robert Pless, Nathan Jacobs, Abby Stylianou

TL;DR

This work addresses the challenge of high-quality, scalable annotations for Composed Image Retrieval (CIR) by introducing good4cir, a vision-language model–driven pipeline that generates rich synthetic triplets through a multi-stage prompting process. The method decomposes the task into object-level description extraction, target-image descriptor alignment, and difference narration, followed by caption permutations to create diverse, human-like modifications. The authors instantiate two new datasets, CIRR_R and Hotel-CIR, by applying the pipeline to rewrite or compose domain-specific CIR data, and show that training on these synthetic datasets improves CIR performance, especially for fine-grained, object-centric queries, with cross-dataset gains observed on CIRCO. The work provides a scalable framework and publicly available tooling to advance CIR and multi-modal retrieval research, enabling broader experimentation and domain adaptation.

Abstract

Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets. We release our dataset construction framework to support further research in CIR and multi-modal retrieval.

good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval

TL;DR

This work addresses the challenge of high-quality, scalable annotations for Composed Image Retrieval (CIR) by introducing good4cir, a vision-language model–driven pipeline that generates rich synthetic triplets through a multi-stage prompting process. The method decomposes the task into object-level description extraction, target-image descriptor alignment, and difference narration, followed by caption permutations to create diverse, human-like modifications. The authors instantiate two new datasets, CIRR_R and Hotel-CIR, by applying the pipeline to rewrite or compose domain-specific CIR data, and show that training on these synthetic datasets improves CIR performance, especially for fine-grained, object-centric queries, with cross-dataset gains observed on CIRCO. The work provides a scalable framework and publicly available tooling to advance CIR and multi-modal retrieval research, enabling broader experimentation and domain adaptation.

Abstract

Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications. Recent advances in vision-language models have improved CIR, but dataset limitations remain a barrier. Existing datasets often rely on simplistic, ambiguous, or insufficient manual annotations, hindering fine-grained retrieval. We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations. Our method involves: (1) extracting fine-grained object descriptions from query images, (2) generating comparable descriptions for target images, and (3) synthesizing textual instructions capturing meaningful transformations between images. This reduces hallucination, enhances modification diversity, and ensures object-level consistency. Applying our method improves existing datasets and enables creating new datasets across diverse domains. Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets. We release our dataset construction framework to support further research in CIR and multi-modal retrieval.

Paper Structure

This paper contains 24 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Existing composed image retrieval datasets are costly to construct and often have low quality text annotations. We propose a new approach that leverages VLMs to generate higher quality, synthetic text annotations for composed image retrieval.
  • Figure 2: Qualitative issues with existing CIR datasets.
  • Figure 3: Our synthetic CIR data generation pipeline. The three-stage pipeline uses a structured flow of data to compare a query image and a target image without overwhelming the context window of the VLM to mitigate hallucination. In this figure, the prompts are simplified. The full prompts are discussed in the text.
  • Figure 4: Comparing the direct single-stage prompting method for capturing differences, versus using good4cir's three-stage approach.
  • Figure 5: Example generated text differences for the CIRR$_R$ (top) and Hotel-CIR (bottom) using our synthetic data generation pipeline. For CIRR$_R$, we include the original caption as well.
  • ...and 2 more figures