Table of Contents
Fetching ...

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang

TL;DR

WISER is a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness, and significantly outperforms previous methods across multiple benchmarks.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

TL;DR

WISER is a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness, and significantly outperforms previous methods across multiple benchmarks.

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.
Paper Structure (21 sections, 9 equations, 12 figures, 5 tables)

This paper contains 21 sections, 9 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Comparison of existing ZS-CIR methods. (a) T2I may fail to preserve visual details from the reference image, while (b) I2I often struggles with complex modifications. In contrast, (c) WISER successfully adapts to diverse modification intents through a "retrieve–verify–refine" pipeline.
  • Figure 2: Overview of the proposed WISER framework. (1) Wider Search. We leverage an editor to produce text and image queries for dual-path retrieval, aggregating the top-$K$ results into a unified candidate pool. (2) Adaptive Fusion. We employ a verifier to assess the candidates with confidence scores, applying a multi-level fusion strategy for high-confidence results and triggering refinement for low-confidence ones. (3) Deeper Thinking. For uncertain retrievals, we leverage a refiner to analyze unmet modifications and then feed targeted suggestions back to the editor, iterating until a predefined limit is reached.
  • Figure 3: Qualitative results on (a) Fashion-IQ, (b) CIRR, and (c) CIRCO datasets. Red indicates wrong, green represents correct and the gray arrow points to refined results.
  • Figure 4: Sensitivity analysis on confidence threshold $\tau$ and refinement iteration N on CIRCO.
  • Figure 5: Comparison between fixed fusion strategies and WISER on CIRCO.$\lambda$ controls the T2I weight in the fixed fusion ($\lambda$ for T2I and 1-$\lambda$ for I2I). Our WISER method achieves superior performance over all $\lambda$ values, highlighting the limitation of static weighting.
  • ...and 7 more figures