Table of Contents
Fetching ...

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

Zixu Li, Zhiheng Fu, Yupeng Hu, Zhiwei Chen, Haokun Wen, Liqiang Nie

TL;DR

This work addresses the limitations of coarse modification text in CIR by introducing a fine-grained CIR paradigm, a robust data annotation pipeline to produce FineMT, and two fine-grained benchmarks (Fine-FashionIQ and Fine-CIRR). It then presents FineCIR, an explicit parsing framework that uses scene graphs and an entity-guided composition mechanism to align fine-grained modification semantics with visual entities, improving retrieval precision. Extensive experiments show that FineCIR outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR tasks, validating its effectiveness and generalization. The authors provide open-source code and datasets to advance research in fine-grained multimodal retrieval.

Abstract

Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.

FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

TL;DR

This work addresses the limitations of coarse modification text in CIR by introducing a fine-grained CIR paradigm, a robust data annotation pipeline to produce FineMT, and two fine-grained benchmarks (Fine-FashionIQ and Fine-CIRR). It then presents FineCIR, an explicit parsing framework that uses scene graphs and an entity-guided composition mechanism to align fine-grained modification semantics with visual entities, improving retrieval precision. Extensive experiments show that FineCIR outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR tasks, validating its effectiveness and generalization. The authors provide open-source code and datasets to advance research in fine-grained multimodal retrieval.

Abstract

Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.

Paper Structure

This paper contains 30 sections, 6 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Problems caused by CoarseMT in (a) Open-domain CIR and (b) Fashion-domain CIR. And (c) illustrates our FineMT in CIR Scenarios.
  • Figure 2: Our fine-grained CIR data annotation pipeline.
  • Figure 3: Overall architecture of our proposed FineCIR.
  • Figure 4: Qualitative examples of our proposed FineCIR compared to the sub-optimal CIR model SPRC.
  • Figure 5: Capability of reducing the imprecise positive samples by using the fine-grained modification text (FineMT).
  • ...and 10 more figures