Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification

Weizhen He; Yiheng Deng; Yunfeng Yan; Feng Zhu; Yizhou Wang; Lei Bai; Qingsong Xie; Donglian Qi; Wanli Ouyang; Shixiang Tang

Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification

Weizhen He, Yiheng Deng, Yunfeng Yan, Feng Zhu, Yizhou Wang, Lei Bai, Qingsong Xie, Donglian Qi, Wanli Ouyang, Shixiang Tang

TL;DR

This work introduces Instruct-ReID, a universal person Re-identification framework that retrieves images based on an input image or language instruction, unifying Trad-ReID, CC-ReID, CTCC-ReID, VI-ReID, T2I-ReID, and LI-ReID as special cases. It presents the OmniReID++ benchmark, a large-scale, multimodal dataset collection with 13 training datasets (5,072,218 images, 333,825 identities) and two evaluation settings to assess task-specific and task-free generalization, plus a novel $mAP_{\tau}$ metric for instruction-consistent retrieval. The authors propose IRM, featuring an Editing Transformer and an adaptive triplet loss, to handle diverse ReID tasks within a single framework, and IRM++, which uses memory-bank assisted learning to boost task-free performance by providing abundant negatives and soft/hard supervision. Across 10 test sets and 6 ReID tasks, IRM/IRM++ achieve state-of-the-art results, demonstrating strong cross-task generalization and practical retrieval capabilities guided by language and visual instructions. The work offers a foundation for unified, instruction-guided ReID systems with broad real-world applicability and highlights future directions in model design and evaluation for multimodal identity retrieval.

Abstract

Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions. To facilitate research in this new instruct-ReID task, we propose a large-scale OmniReID++ benchmark equipped with diverse data and comprehensive evaluation methods e.g., task specific and task-free evaluation settings. In the task-specific evaluation setting, gallery sets are categorized according to specific ReID tasks. We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework. For task-free evaluation setting, where target person images are retrieved from task-agnostic gallery sets, we further propose a new method called IRM++ with novel memory bank-assisted learning. Extensive evaluations of IRM and IRM++ on OmniReID++ benchmark demonstrate the superiority of our proposed methods, achieving state-of-the-art performance on 10 test sets. The datasets, the model, and the code will be available at https://github.com/hwz-zju/Instruct-ReID

Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification

TL;DR

metric for instruction-consistent retrieval. The authors propose IRM, featuring an Editing Transformer and an adaptive triplet loss, to handle diverse ReID tasks within a single framework, and IRM++, which uses memory-bank assisted learning to boost task-free performance by providing abundant negatives and soft/hard supervision. Across 10 test sets and 6 ReID tasks, IRM/IRM++ achieve state-of-the-art results, demonstrating strong cross-task generalization and practical retrieval capabilities guided by language and visual instructions. The work offers a foundation for unified, instruction-guided ReID systems with broad real-world applicability and highlights future directions in model design and evaluation for multimodal identity retrieval.

Abstract

Paper Structure (25 sections, 14 equations, 12 figures, 8 tables)

This paper contains 25 sections, 14 equations, 12 figures, 8 tables.

Introduction
Related Work
Person Re-identification
Loss Function in ReID
Multimodal Retrieval
Memory Bank
OmniReID++ Benchmark
Instruct ReID Methodology
Instruction Generation
Model Architecture of IRM
Editing Transformer
Adaptive Triplet Loss
Overall Loss Function
Model Architecture of IRM++
Experiments
...and 10 more sections

Figures (12)

Figure 1: (a) We proposed a new instruct-ReID task that unites various ReID tasks. Traditional ReID: The instruction may be “Do not change clothes". Clothes-changing ReID: The instruction may be “Ignore clothes". Clothes template based clothes-changing ReID: The instruction is a cropped clothes image and the model should retrieve the same person wearing the provided clothing. Language-instructed ReID: The instruction is several sentences describing pedestrian attributes. The model is required to retrieve the person described by the instruction. Visible-Infrared ReID: The instruction can be “Cross modality". Text-to-image ReID: The model retrieves images according to the description sentence. (b) Our proposed method advances the performance limits of various person ReID tasks through a unified retrieval model. Specifically, on 10 datasets across the 6 ReID tasks, our method achieves a performance improvement of +0.3% mAP to +22.4% mAP compared to existing state-of-the-art methods
Figure 2: The comparison between task-specific and task-free evaluation settings. Task-free evaluation setting offers a more uniformly flexible evaluation approach.
Figure 3: (a) We generate attributes for a person and then transform attributes into sentences by a large language model. (b) We crop upper clothes and search them online for clothes templates.
Figure 4: Illustration of the difference between metric mAP and mAP$\tau$. Light orange in the rank list indicates positive and negative samples in light blue. In the example of mAP$\tau$, we set the threshold to 0.5 and mark with an orange dashed box where the identity matches the query but the instruction similarity does not reach the threshold and is corrected to a negative sample.
Figure 5: The overall architecture of the proposed method. The instruction is fed into the instruction encoder to extract instruction features (a). The features are then propagated into the editing transformer (b) to capture instruction-edited features. We exploit adaptive triplet loss and identification loss to train the network. For the testing stage, we use the CLS token for retrieval.
...and 7 more figures

Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification

TL;DR

Abstract

Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)