Table of Contents
Fetching ...

SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval

Yi Sun, Jinyu Xu, Qing Xie, Jiachen Li, Yanchun Ma, Yongjian Liu

TL;DR

SDR-CIR tackles zero-shot Composed Image Retrieval by addressing visual noise in description generation and semantic bias in ranking. It introduces Selective CoT to focus visual extraction on modification-relevant content and a two-step Semantic Debias Ranking (Anchor–Debias) to reinforce useful cues while penalizing reference-induced bias. The approach is training-free and one-stage, delivering state-of-the-art performance among one-stage ZS-CIR methods across CIRCO, CIRR, and FashionIQ with favorable efficiency. The findings demonstrate robust handling of both redundant and omitted cues, improving real-world CIR retrieval accuracy and practicality.

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image from a query composed of a reference image and modification text. Recent training-free zero-shot methods often employ Multimodal Large Language Models (MLLMs) with Chain-of-Thought (CoT) to compose a target image description for retrieval. However, due to the fuzzy matching nature of ZS-CIR, the generated description is prone to semantic bias relative to the target image. We propose SDR-CIR, a training-free Semantic Debias Ranking method based on CoT reasoning. First, Selective CoT guides the MLLM to extract visual content relevant to the modification text during image understanding, thereby reducing visual noise at the source. We then introduce a Semantic Debias Ranking with two steps, Anchor and Debias, to mitigate semantic bias. In the Anchor step, we fuse reference image features with target description features to reinforce useful semantics and supplement omitted cues. In the Debias step, we explicitly model the visual semantic contribution of the reference image to the description and incorporate it into the similarity score as a penalty term. By supplementing omitted cues while suppressing redundancy, SDR-CIR mitigates semantic bias and improves retrieval performance. Experiments on three standard CIR benchmarks show that SDR-CIR achieves state-of-the-art results among one-stage methods while maintaining high efficiency. The code is publicly available at https://github.com/suny105/SDR-CIR.

SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval

TL;DR

SDR-CIR tackles zero-shot Composed Image Retrieval by addressing visual noise in description generation and semantic bias in ranking. It introduces Selective CoT to focus visual extraction on modification-relevant content and a two-step Semantic Debias Ranking (Anchor–Debias) to reinforce useful cues while penalizing reference-induced bias. The approach is training-free and one-stage, delivering state-of-the-art performance among one-stage ZS-CIR methods across CIRCO, CIRR, and FashionIQ with favorable efficiency. The findings demonstrate robust handling of both redundant and omitted cues, improving real-world CIR retrieval accuracy and practicality.

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image from a query composed of a reference image and modification text. Recent training-free zero-shot methods often employ Multimodal Large Language Models (MLLMs) with Chain-of-Thought (CoT) to compose a target image description for retrieval. However, due to the fuzzy matching nature of ZS-CIR, the generated description is prone to semantic bias relative to the target image. We propose SDR-CIR, a training-free Semantic Debias Ranking method based on CoT reasoning. First, Selective CoT guides the MLLM to extract visual content relevant to the modification text during image understanding, thereby reducing visual noise at the source. We then introduce a Semantic Debias Ranking with two steps, Anchor and Debias, to mitigate semantic bias. In the Anchor step, we fuse reference image features with target description features to reinforce useful semantics and supplement omitted cues. In the Debias step, we explicitly model the visual semantic contribution of the reference image to the description and incorporate it into the similarity score as a penalty term. By supplementing omitted cues while suppressing redundancy, SDR-CIR mitigates semantic bias and improves retrieval performance. Experiments on three standard CIR benchmarks show that SDR-CIR achieves state-of-the-art results among one-stage methods while maintaining high efficiency. The code is publicly available at https://github.com/suny105/SDR-CIR.
Paper Structure (20 sections, 4 equations, 7 figures, 5 tables)

This paper contains 20 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of existing one-stage method and our proposed method SDR-CIR.
  • Figure 2: Overview of our SDR-CIR framework. (1) The Selective CoT prompt instructs the MLLM to extract visual content relevant to modification guided by the modification text. (2) Semantic Debias Ranking: we fuse reference image and target image description feature as the composed query to anchor the useful visual semantics and then represent the similarity between the visual semantic contribution and candidate images as a penalty term to debias.
  • Figure 3: Comparison on CoT prompt between OSrCIR and ours.
  • Figure 4: Hyperparameter analysis of $\alpha$ and $\beta$ on CIRCO test set, CIRR test set and FashionIQ val set. All experiments are performed with the ViT-L/14.
  • Figure 5: Robustness to redundant and missing information on CIRCO and FashionIQ. Top-2 retrieval results of SDR-CIR and a description-only baseline (Base Result) are compared. Red text marks redundant or missing information; green boxes indicate targets.
  • ...and 2 more figures