Table of Contents
Fetching ...

Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su

TL;DR

This work tackles the data bottleneck in Composed Image Retrieval (CIR) by proposing CIRHS, a large-scale synthetic triplet dataset created through an LLM-guided prompt pipeline and high-quality image synthesis, followed by MLLM-based filtering. It also introduces CoAlign, a Hybrid Contextual Alignment framework that optimizes CIR representations via global contextual alignment and local contextual reasoning, suitable for both zero-shot and supervised settings. Empirical results show state-of-the-art zero-shot performance across FashionIQ, CIRR, and CIRCO when trained on CIRHS, and superior supervised performance against prior CIR methods, validating the effectiveness of fully synthetic training data for CIR. The approach offers a scalable, domain-agnostic route to robust multimodal retrieval, with practical impact for e-commerce and search applications where labeled triplets are scarce.

Abstract

As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.

Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

TL;DR

This work tackles the data bottleneck in Composed Image Retrieval (CIR) by proposing CIRHS, a large-scale synthetic triplet dataset created through an LLM-guided prompt pipeline and high-quality image synthesis, followed by MLLM-based filtering. It also introduces CoAlign, a Hybrid Contextual Alignment framework that optimizes CIR representations via global contextual alignment and local contextual reasoning, suitable for both zero-shot and supervised settings. Empirical results show state-of-the-art zero-shot performance across FashionIQ, CIRR, and CIRCO when trained on CIRHS, and superior supervised performance against prior CIR methods, validating the effectiveness of fully synthetic training data for CIR. The approach offers a scalable, domain-agnostic route to robust multimodal retrieval, with practical impact for e-commerce and search applications where labeled triplets are scarce.

Abstract

As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.

Paper Structure

This paper contains 24 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overall framework of our method. (a) The triplet synthesis pipeline consists of three stages. In the first stage, an LLM is guided to generate diverse textual quadruples. Subsequently, consistent image pairs are synthesized using the quadruples and reorganized into triplet form. The final stage leverages an MLLM to filter out low-quality samples. (b) The model architecture of CoAlign. The left side illustrates the encoding process of the query and target using different encoding modes, while the right shows the global and local optimization objectives employed by CoAlign.
  • Figure 2: Qualitative results on CIRR and FahsionIQ. The reference image and target images are highlighted with green and red outline, respectively.
  • Figure 3: Comparison of three paradigms for synthesizing image pairs. Compared to the other two approaches (b) and (c), our approach (a) is superior in both generation quality and consistency.
  • Figure 4: Hyperparameter and data scale analysis. Left: Sensitivity analysis of CoAlign on different hyperparameters. Right: Impact of data scale on zero-shot performance.