Table of Contents
Fetching ...

CoLLM: A Large Language Model for Composed Image Retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, Abhinav Shrivastava

TL;DR

CoLLM tackles the data bottleneck in composed image retrieval by generating CIR triplets on the fly from image-caption pairs and leveraging large language models to produce joint reference-text embeddings. It introduces the Multi-Text CIR dataset (MTCIR) with 3.4 million image pairs and 17.7 million modification texts, plus refined CIRR and Fashion-IQ benchmarks to reduce ambiguity. Through pre-training on image-caption data and targeted fine-tuning on MTCIR, CoLLM achieves state-of-the-art results on multiple CIR benchmarks and demonstrates competitive gains when the synthetic data is used to train other models. The refined benchmarks and surrounding ablations provide robust evaluation and insights into how to best fuse vision and language for nuanced query understanding in CIR.

Abstract

Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

CoLLM: A Large Language Model for Composed Image Retrieval

TL;DR

CoLLM tackles the data bottleneck in composed image retrieval by generating CIR triplets on the fly from image-caption pairs and leveraging large language models to produce joint reference-text embeddings. It introduces the Multi-Text CIR dataset (MTCIR) with 3.4 million image pairs and 17.7 million modification texts, plus refined CIRR and Fashion-IQ benchmarks to reduce ambiguity. Through pre-training on image-caption data and targeted fine-tuning on MTCIR, CoLLM achieves state-of-the-art results on multiple CIR benchmarks and demonstrates competitive gains when the synthetic data is used to train other models. The refined benchmarks and surrounding ablations provide robust evaluation and insights into how to best fuse vision and language for nuanced query understanding in CIR.

Abstract

Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

Paper Structure

This paper contains 26 sections, 4 equations, 15 figures, 24 tables.

Figures (15)

  • Figure 1: (a) An example of CIR. (b) Recall Sum at {1,10,50} for CIRR and {10, 50} for Fashion-IQ between CoLLM and state-of-the-art (SoTA) models under zero-shot settings. We evaluate two training scenarios: (i) without triplet data and (ii) with synthetic triplet data.
  • Figure 2: An overview of our model and training strategies when using (a) image-caption pairs and (b) CIR triplets.
  • Figure 3: An overview of reference image embedding synthesis and modification text synthesis. The red-framed image represents the nearest neighbor of the augmented image in the training batch.
  • Figure 4: An example from the MTCIR dataset. Each sample contains multiple short texts describing different modifications.
  • Figure 5: Statistical analysis of the refinement process for CIRR and FashionIQ (FIQ) datasets.
  • ...and 10 more figures