Table of Contents
Fetching ...

ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, Nathan Jacobs

TL;DR

ConText-CIR tackles composed image retrieval with complex text modifications by enforcing concept-level grounding of noun phrases to image regions through a Text Concept-Consistency loss within a cross-attention fusion model. It introduces a synthetic data generation pipeline, good4cir, to produce richer, multi-attribute CIR training data from existing datasets and unlabeled images. The approach delivers state-of-the-art performance on CIRR and CIRCO in both supervised and zero-shot settings, with strong evidence that targeted concept grounding and diverse data substantially boost retrieval accuracy. The work provides practical tools for data augmentation and grounding, facilitating real-world CIR across domains and complexities.

Abstract

Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.

ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

TL;DR

ConText-CIR tackles composed image retrieval with complex text modifications by enforcing concept-level grounding of noun phrases to image regions through a Text Concept-Consistency loss within a cross-attention fusion model. It introduces a synthetic data generation pipeline, good4cir, to produce richer, multi-attribute CIR training data from existing datasets and unlabeled images. The approach delivers state-of-the-art performance on CIRR and CIRCO in both supervised and zero-shot settings, with strong evidence that targeted concept grounding and diverse data substantially boost retrieval accuracy. The work provides practical tools for data augmentation and grounding, facilitating real-world CIR across domains and complexities.

Abstract

Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at https://github.com/mvrl/ConText-CIR.

Paper Structure

This paper contains 23 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A failure case of current composed image retrieval methods. Previous methods do not accurately capture the two conditions specified by the text.
  • Figure 2: The overall architecture of our approach, ConText-CIR. The framework guides attention to the related image regions by penalizing large differences between attention maps resulting from concept-specific and whole-text representations for each noun phrase. During inference, our method operates efficiently using a simple cross-attention mechanism to combine image and text features. The right side of the figure shows that the cross-attention between the concept "the hand" from the representation of the entire text and the image query converges to the local region around the hand with very little spurious attention.
  • Figure 3: Qualitative issues with existing CIR datasets.
  • Figure 4: Examples of original captions and rewritten CIRR captions.
  • Figure 5: Noun-phrase-level cross attention maps for models trained with and without the Text Concept-Consistency loss.
  • ...and 3 more figures