Table of Contents
Fetching ...

MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Xuri Ge, Chunhao Wang, Xindi Wang, Zheyun Qin, Zhumin Chen, Xin Xin

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.
Paper Structure (15 sections, 12 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 12 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Compared with the traditional CIR model (B) and cross-modal attention-based CIR model (C), in this paper, we introduce multimodal chain-of-thought reasoning to explicitly extract the relevant and irrelevant contents for reference visual information selection (A). In this way, while effectively focusing on fine-grained useful references, we can significantly avoid the attention noise interference driven by visual-linguistic correlation as in method (C).
  • Figure 2: Illustration of our proposed MCoT-MVS. It primarily consists of the Multimodal Chain-of-Thought (CoT) Reasoning module, Patch-level Visual Reference Selection (PVRS), the Instance-level Visual Reference Selection module (IVRS), and a Weighted Hierarchical Combination (WHC). The multimodal CoT reasoning module first reasons the composed query into retained and deleted contents, as well as the potential target context, based on a pre-trained MLLM. Then, the useful patch-level and instance-level reference visual representations are semantically selected by the explicitly reasoned retained and deleted texts to retain the correct user intent and to remove visual noise. Finally, the WHC fuses the selected representations with multiple target texts into an attention-aware query, aligning with the target image.
  • Figure 3: Visualization of the learned attention weights from PVRS and IVRS modules guided by the inferred explicit retain or delete modification intents from the MLLM-based multimodal CoT reasoning, as well as the weight distributions of weighted hierarchical combination modules (best viewed in color).