Table of Contents
Fetching ...

CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

Weihuang Lin, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

TL;DR

CIR-CoT addresses the challenge of interpretable compositional image retrieval by introducing an end-to-end Multimodal LLM that generates explicit Chain-of-Thought reasoning before producing a target-image embedding. The approach augments data with three-stage CoT annotations (Caption, Reasoning, Conclusion) and trains the model in two stages to compress user intent into a dedicated embedding token, optimized with a joint objective $\mathcal{L} = \lambda_{txt} \mathcal{L}_{txt} + \lambda_{Info} \mathcal{L}_{InfoNCE}$ where $\mathcal{L}_{txt} = CE(y_{txt}, \hat{y}_{txt})$. Empirical results on Fashion-IQ, CIRR, and CIRCO show competitive in-domain performance and strong cross-domain generalization, highlighting the benefits of explicit reasoning for fine-grained cross-modal alignment. This work advances trustworthy CIR by making retrieval rationale transparent and demonstrates a scalable pathway to reasoning-aware multimodal retrieval systems.

Abstract

Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.

CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

TL;DR

CIR-CoT addresses the challenge of interpretable compositional image retrieval by introducing an end-to-end Multimodal LLM that generates explicit Chain-of-Thought reasoning before producing a target-image embedding. The approach augments data with three-stage CoT annotations (Caption, Reasoning, Conclusion) and trains the model in two stages to compress user intent into a dedicated embedding token, optimized with a joint objective where . Empirical results on Fashion-IQ, CIRR, and CIRCO show competitive in-domain performance and strong cross-domain generalization, highlighting the benefits of explicit reasoning for fine-grained cross-modal alignment. This work advances trustworthy CIR by making retrieval rationale transparent and demonstrates a scalable pathway to reasoning-aware multimodal retrieval systems.

Abstract

Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.

Paper Structure

This paper contains 17 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of three retrieval approaches: (a) VLM-based method; (b) MLLM-based method (treating the MLLM as an encoder); (c) our CIR-CoT approach, enhanced with Chain-of-Thought reasoning for more accurate image retrieval.
  • Figure 2: The pipeline for constructing CoT training data. A multimodal query is processed through automated annotation to produce reasoning-augmented descriptions, followed by MLLM-based evaluation for quality control.
  • Figure 3: Overview of the proposed baseline CIR-CoT. The method leverages MLLMs to generate reasoning chains for the target image and obtain its embedding token <emb>, followed by contrastive learning to improve retrieval.
  • Figure 4: Qualitative Results on CIRR dataset.