SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval
Yuqi Xiao, Yingying Zhu
TL;DR
Zero-shot CIR aims to retrieve a target image given a reference image and a relative text without triplet annotations. The authors propose SETR, a two-stage framework that first performs intersection-driven coarse retrieval to prune distractors and then uses a LoRA-adapted MLLM to perform binary semantic relevance judgments for fine-grained re-ranking. This approach addresses CLIP's union-based interference and its lack of fine-grained discrimination. Experiments on CIRR, CIRCO, and FashionIQ show state-of-the-art performance with significant recall gains, validating the two-stage reasoning paradigm for robust, portable ZS-CIR.
Abstract
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments ("Yes/No"), which goes beyond CLIP's global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.
