A Comprehensive Survey on Composed Image Retrieval
Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie
TL;DR
This survey addresses Composed Image Retrieval (CIR), where a target image is retrieved using a multimodal query consisting of a reference image and a modification text. It systematically categorizes supervised and zero-shot CIR methods, detailing four core components (feature extraction, image-text fusion, target matching, data augmentation) and three zero-shot families (textual inversion, pseudo-triplets, training-free), while also covering related tasks and benchmarks. The authors highlight the dominance of vision-language pre-trained encoders in achieving strong performance, the effectiveness of diverse fusion and matching strategies, and the value of data augmentation and reranking in pushing accuracy. They also discuss the limitations of current datasets and the potential of large-language-model–driven approaches to advance CIR, offering concrete directions for scalable benchmarks, robust learning under noise, and efficient retrieval pipelines.
Abstract
Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration. The curated collection of related works is maintained and continuously updated in https://github.com/haokunwen/Awesome-Composed-Image-Retrieval.
