Table of Contents
Fetching ...

SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval

Yuqi Xiao, Yingying Zhu

TL;DR

Zero-shot CIR aims to retrieve a target image given a reference image and a relative text without triplet annotations. The authors propose SETR, a two-stage framework that first performs intersection-driven coarse retrieval to prune distractors and then uses a LoRA-adapted MLLM to perform binary semantic relevance judgments for fine-grained re-ranking. This approach addresses CLIP's union-based interference and its lack of fine-grained discrimination. Experiments on CIRR, CIRCO, and FashionIQ show state-of-the-art performance with significant recall gains, validating the two-stage reasoning paradigm for robust, portable ZS-CIR.

Abstract

Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments ("Yes/No"), which goes beyond CLIP's global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.

SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval

TL;DR

Zero-shot CIR aims to retrieve a target image given a reference image and a relative text without triplet annotations. The authors propose SETR, a two-stage framework that first performs intersection-driven coarse retrieval to prune distractors and then uses a LoRA-adapted MLLM to perform binary semantic relevance judgments for fine-grained re-ranking. This approach addresses CLIP's union-based interference and its lack of fine-grained discrimination. Experiments on CIRR, CIRCO, and FashionIQ show state-of-the-art performance with significant recall gains, validating the two-stage reasoning paradigm for robust, portable ZS-CIR.

Abstract

Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments ("Yes/No"), which goes beyond CLIP's global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.

Paper Structure

This paper contains 21 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of union-based vs. intersection-driven strategies. We show attention heatmaps on the same hard negative image retrieved by the union-based strategy (red border). Top: the union-based strategy retains the irrelevant “in the snow” cue, ranks the wrong image at recall@1 (red border), and its similarity attention focuses on the snowy background. Bottom: our intersection-driven strategy correctly retrieves the target image (green border) with a clean pseudo-target description, and its similarity attention highlights the precise clue “hugging.”
  • Figure 2: Overall framework of the proposed SETR for ZS-CIR.
  • Figure 3: Examples from CIRR showing SETR’s improvement over union-based baselines. Our intersection-driven retrieval eliminates background noise, and re-ranking resolves compositional phrasing to recover the correct target.