Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification
Weizhen He, Yiheng Deng, Yunfeng Yan, Feng Zhu, Yizhou Wang, Lei Bai, Qingsong Xie, Donglian Qi, Wanli Ouyang, Shixiang Tang
TL;DR
This work introduces Instruct-ReID, a universal person Re-identification framework that retrieves images based on an input image or language instruction, unifying Trad-ReID, CC-ReID, CTCC-ReID, VI-ReID, T2I-ReID, and LI-ReID as special cases. It presents the OmniReID++ benchmark, a large-scale, multimodal dataset collection with 13 training datasets (5,072,218 images, 333,825 identities) and two evaluation settings to assess task-specific and task-free generalization, plus a novel $mAP_{\tau}$ metric for instruction-consistent retrieval. The authors propose IRM, featuring an Editing Transformer and an adaptive triplet loss, to handle diverse ReID tasks within a single framework, and IRM++, which uses memory-bank assisted learning to boost task-free performance by providing abundant negatives and soft/hard supervision. Across 10 test sets and 6 ReID tasks, IRM/IRM++ achieve state-of-the-art results, demonstrating strong cross-task generalization and practical retrieval capabilities guided by language and visual instructions. The work offers a foundation for unified, instruction-guided ReID systems with broad real-world applicability and highlights future directions in model design and evaluation for multimodal identity retrieval.
Abstract
Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions. To facilitate research in this new instruct-ReID task, we propose a large-scale OmniReID++ benchmark equipped with diverse data and comprehensive evaluation methods e.g., task specific and task-free evaluation settings. In the task-specific evaluation setting, gallery sets are categorized according to specific ReID tasks. We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework. For task-free evaluation setting, where target person images are retrieved from task-agnostic gallery sets, we further propose a new method called IRM++ with novel memory bank-assisted learning. Extensive evaluations of IRM and IRM++ on OmniReID++ benchmark demonstrate the superiority of our proposed methods, achieving state-of-the-art performance on 10 test sets. The datasets, the model, and the code will be available at https://github.com/hwz-zju/Instruct-ReID
