Table of Contents
Fetching ...

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions

Weizhen He, Yiheng Deng, Shixiang Tang, Qihao Chen, Qingsong Xie, Yizhou Wang, Lei Bai, Feng Zhu, Rui Zhao, Wanli Ouyang, Donglian Qi, Yunfeng Yan

TL;DR

This work proposes Instruct-ReID, a unified framework for person re-identification guided by image or language instructions, and introduces OmniReID, a large-scale benchmark spanning 12 datasets across 6 ReID tasks to enable cross-task training and evaluation. The IRM architecture fuses instruction signals into query features via an Editing Transformer and an instruction-aware adaptive triplet loss, jointly optimizing identity and instruction alignment. Empirical results show consistent gains across Trad-, CC-, CTCC-, VI-, T2I-, and LI-ReID tasks, including a substantial +24.9% mAP improvement on language-instructed ReID, demonstrating strong cross-task generalization with a single model. The work enables easier deployment and extension to new ReID scenarios, with broad practical implications for cross-modal and language-guided person retrieval.

Abstract

Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model, trained on our OmniReID benchmark without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17, CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID, +4.3% on LLCM for visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The datasets, the model, and code will be available at https://github.com/hwz-zju/Instruct-ReID.

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions

TL;DR

This work proposes Instruct-ReID, a unified framework for person re-identification guided by image or language instructions, and introduces OmniReID, a large-scale benchmark spanning 12 datasets across 6 ReID tasks to enable cross-task training and evaluation. The IRM architecture fuses instruction signals into query features via an Editing Transformer and an instruction-aware adaptive triplet loss, jointly optimizing identity and instruction alignment. Empirical results show consistent gains across Trad-, CC-, CTCC-, VI-, T2I-, and LI-ReID tasks, including a substantial +24.9% mAP improvement on language-instructed ReID, demonstrating strong cross-task generalization with a single model. The work enables easier deployment and extension to new ReID scenarios, with broad practical implications for cross-modal and language-guided person retrieval.

Abstract

Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model, trained on our OmniReID benchmark without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17, CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID, +4.3% on LLCM for visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The datasets, the model, and code will be available at https://github.com/hwz-zju/Instruct-ReID.
Paper Structure (26 sections, 8 equations, 9 figures, 9 tables)

This paper contains 26 sections, 8 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a) We proposed a new instruct-ReID task that unites various ReID tasks. Traditional ReID: The instruction may be “Do not change clothes". Clothes-changing ReID: The instruction may be “Ignore clothes". Clothes template based clothes-changing ReID: The instruction is a cropped clothes image and the model should retrieve the same person wearing the provided clothing. Language-instructed ReID: The instruction is several sentences describing pedestrian attributes. The model is required to retrieve the person described by the instruction. Visible-Infrared ReID: The instruction can be “Cross modality". Text-to-image ReID: The model retrieves images according to the description sentence. (b) Our proposed method improves the performance of various person ReID tasks by a unified retrieval model.
  • Figure 2: (a) We generate attributes for a person and then transform attributes into sentences by a large language model. (b) We crop upper clothes and search them online for clothes templates.
  • Figure 3: The overall architecture of the proposed method. The instruction is fed into the instruction encoder to extract instruction features (a). The features are then propagated into the editing transformer (b) to capture instruction-edited features. We exploit adaptive triplet loss and identification loss to train the network. For the testing stage, we use the CLS token for retrieval.
  • Figure 4: Illustration of adaptive triplet loss. Unlike the traditional triplet loss where the margin is fixed, the margin in our adaptive triplet loss is defined by the instruction similarity for the two query-instruction pairs that describe the same person. The features associated with similar instructions are pulled to be closer.
  • Figure 5: Illustration of all tasks retrieval results. We visualize the task-specific instructions on three people as examples. Green and red boxes mean true and false matches.
  • ...and 4 more figures