Table of Contents
Fetching ...

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Yuanmin Tang, Xiaoting Qin, Jue Zhang, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Ling, Saravan Rajmohan, Dongmei Zhang, Qi Wu

TL;DR

This work tackles training-free zero-shot CIR by introducing OSrCIR, a one-stage framework that uses Multimodal Large Language Models (MLLMs) to directly infer a target image description $T_t$ from a reference image $I_r$ and manipulation text $T_m$ via Reflective CoT, thereby avoiding information loss from captioning stages. The target description is computed as $T_t = Ψ_M(p_c ∘ I_r ∘ T_m)$ and retrieved with CLIP-based cosine similarity, enabling efficient, interpretable reasoning in the language domain. OSrCIR achieves substantial improvements (1.80% to 6.44% gains) over existing training-free methods across four CIR tasks on backbones like ViT-L/14, setting new state-of-the-art for ZS-CIR while maintaining competitive inference speed (~0.6s per query). The approach also emphasizes interpretability through Reflective CoT and Vision-by-Language in-context learning, and the authors provide a complete prompt template and baseline code at the linked repository, facilitating adoption in vision-language applications.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at https://github.com/Pter61/osrcir2024/.

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

TL;DR

This work tackles training-free zero-shot CIR by introducing OSrCIR, a one-stage framework that uses Multimodal Large Language Models (MLLMs) to directly infer a target image description from a reference image and manipulation text via Reflective CoT, thereby avoiding information loss from captioning stages. The target description is computed as and retrieved with CLIP-based cosine similarity, enabling efficient, interpretable reasoning in the language domain. OSrCIR achieves substantial improvements (1.80% to 6.44% gains) over existing training-free methods across four CIR tasks on backbones like ViT-L/14, setting new state-of-the-art for ZS-CIR while maintaining competitive inference speed (~0.6s per query). The approach also emphasizes interpretability through Reflective CoT and Vision-by-Language in-context learning, and the authors provide a complete prompt template and baseline code at the linked repository, facilitating adoption in vision-language applications.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at https://github.com/Pter61/osrcir2024/.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of our motivation. (a) Two-stage implicit intention reasoning of the baseline CIReVL method. (b) Our one-stage approach OSrCIR with explicit intention reasoning.
  • Figure 2: An overview of our model. An MLLM processes the reference image and the manipulation text to generate a description of the desired target image by reflective CoT. To obtain the desired image, we use a vision-language model and perform text-to-image retrieval. Different colors denote the reasoning outcomes of each intention.
  • Figure 3: Results on the object manipulation on the CIRR.
  • Figure 4: Results of attribute manipulation on the FashionIQ.
  • Figure 5: Visualization of Reflective CoT samples. We compare the top 1 retrieval results of ours and CIReVL. Different colors denote the reasoning outcomes of each intention. Our Reflective CoT effectively filters out elements irrelevant to user intention.
  • ...and 3 more figures