Table of Contents
Fetching ...

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

Zelong Sun, Dong Jing, Zhiwu Lu

TL;DR

Zero-Shot Composed Image Retrieval (ZS-CIR) retrieves target images from a gallery using a composed query without annotated triplets, but prior approaches suffer from modality incompatibility, visual information loss, and shallow reasoning. The authors propose CoTMR, a training-free framework that uses a Large Vision-Language Model (LVLM) with CIRCoT for structured, step-by-step reasoning, augmented by multi-scale reasoning and a Multi-Grained Scoring (MGS) mechanism to fuse global and fine-grained cues for retrieval. CIRCoT pre-defines subtasks to guide LVLM reasoning and provides interpretable intermediate outputs, while multi-scale reasoning yields image-scale captions and object-scale existent/nonexistent objects to refine relevance. MGS integrates CLIP-based similarities from these outputs, rewarding relevant content and penalizing irrelevant content, leading to robust gains on FashionIQ, CIRR, and CIRCO across multiple CLIP backbones. Overall, CoTMR delivers strong, interpretable performance without training data and suggests promising directions for hybridization with pseudo-token methods or finer-grained detection cues.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.

CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

TL;DR

Zero-Shot Composed Image Retrieval (ZS-CIR) retrieves target images from a gallery using a composed query without annotated triplets, but prior approaches suffer from modality incompatibility, visual information loss, and shallow reasoning. The authors propose CoTMR, a training-free framework that uses a Large Vision-Language Model (LVLM) with CIRCoT for structured, step-by-step reasoning, augmented by multi-scale reasoning and a Multi-Grained Scoring (MGS) mechanism to fuse global and fine-grained cues for retrieval. CIRCoT pre-defines subtasks to guide LVLM reasoning and provides interpretable intermediate outputs, while multi-scale reasoning yields image-scale captions and object-scale existent/nonexistent objects to refine relevance. MGS integrates CLIP-based similarities from these outputs, rewarding relevant content and penalizing irrelevant content, leading to robust gains on FashionIQ, CIRR, and CIRCO across multiple CLIP backbones. Overall, CoTMR delivers strong, interpretable performance without training data and suggests promising directions for hybridization with pseudo-token methods or finer-grained detection cues.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.

Paper Structure

This paper contains 26 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Flowcharts of existing ZS-CIR methods and our proposed CoTMR. Methods (a) and (b) face serious issues of visual information loss and insufficient reasoning. In contrast, our method (c) fully perceives image content, enhances reasoning process with CIRCoT, and augments multi-grained descriptions with multi-scale reasoning.
  • Figure 2: Overview architecture of CoTMR: (1) The LVLM equipped with CIRCoT, $P_{Img}$ and $P_{Obj}$, performs reasoning on the composed query at both image and object scales, to provide multi-grained outputs. (2) The Multi-Grained Scoring Mechanism combines the similarities of the three outputs with candidate images in the CLIP space through a reward-penalty calculation. IE and TE represent the image encoder and text encoder of CLIP, respectively.
  • Figure 3: Illustration of CIRCoT in image-scale reasoning ($P_{Img}$), which includes four predefined subtasks and allows LVLM to reason step-by-step within each subtask. CIRCoT in object-scale reasoning ($P_{Obj}$) follows a similar process (see appendix \ref{['sec_appendix_P_O']} for details).
  • Figure 4: Ablation study on the value of $\lambda$ and $\mu$ on Fashion-IQ val set and CIRR val set. All experiments are performed with the ViT-B/32 CLIP model.
  • Figure 5: An example of a reasoning process with CIRCoT from CIRR val set. The LVLM focuses on specific objectives in each subtask within CIRCoT and gradually completes the overall reasoning goal.
  • ...and 5 more figures