Table of Contents
Fetching ...

A Versatile Framework for Multi-scene Person Re-identification

Wei-Shi Zheng, Junkai Yan, Yi-Xing Peng

TL;DR

VersReID addresses the need for a single model capable of multi-scene person Re-ID by introducing a two-stage, prompt-based twin framework. It first builds a ReID Bank with scene-specific prompts to capture diverse scene knowledge, then distills this knowledge into a V-Branch with versatile prompts that operates without scene labels at inference. A self-supervised pretraining strategy, MPDA, injects multi-scene priors to improve generalization across general, low-resolution, clothing-change, occlusion, and cross-modality scenes. Empirical results across seven downstream datasets and a joint testing set show that VersReID achieves strong performance and robustness, outperforming many multi-scene baselines and approaching or surpassing some single-scene methods, with further gains from the VersReID* variant via overlapping patch embeddings. The work demonstrates a practical pathway to versatile, scalable ReID in realistic, multi-scene deployments and highlights the value of prompt-based knowledge distillation and SSL augmentation.

Abstract

Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes. Codes and models are available at https://github.com/iSEE-Laboratory/VersReID.

A Versatile Framework for Multi-scene Person Re-identification

TL;DR

VersReID addresses the need for a single model capable of multi-scene person Re-ID by introducing a two-stage, prompt-based twin framework. It first builds a ReID Bank with scene-specific prompts to capture diverse scene knowledge, then distills this knowledge into a V-Branch with versatile prompts that operates without scene labels at inference. A self-supervised pretraining strategy, MPDA, injects multi-scene priors to improve generalization across general, low-resolution, clothing-change, occlusion, and cross-modality scenes. Empirical results across seven downstream datasets and a joint testing set show that VersReID achieves strong performance and robustness, outperforming many multi-scene baselines and approaching or surpassing some single-scene methods, with further gains from the VersReID* variant via overlapping patch embeddings. The work demonstrates a practical pathway to versatile, scalable ReID in realistic, multi-scene deployments and highlights the value of prompt-based knowledge distillation and SSL augmentation.

Abstract

Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes. Codes and models are available at https://github.com/iSEE-Laboratory/VersReID.
Paper Structure (22 sections, 3 equations, 5 figures, 9 tables)

This paper contains 22 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (a) Traditional frameworks train distinctive models for different scenes and usually require auxiliary information such as the contour and key points information. (b) Differently, our versatile ReID framework (VersReID) can solve multiple ReID scenes simultaneously without auxiliary information. The colored circles in the figure indicate prompts.
  • Figure 2: An overview of the VersReID framework. Our VersReID handles the multi-scene ReID by capturing different types of knowledge in a model called ReID Bank (the first stage) under the indication of scene labels and then unifying the knowledge into the V-Branch model (the second stage) to get rid of scene labels. The top side illustrates the first learning stage, namely the prompt-based multi-scene joint training. Based on scene labels, we associate distinctive prompts with images under different scenes, learning to encode scene-specific knowledge in prompts. By integrating the scene-specific prompts with a scene-shared backbone that contains scene-invariant knowledge, we build a multi-scene ReID Bank. Although with abundant knowledge, the usability of ReID Bank is limited since it requires scene labels to select scene-specific prompts. We solve this problem in the second stage, namely the scene-specific prompt distillation. As illustrated in the bottom part, we distillate the ReID Bank's knowledge to a versatile branch with versatile prompts called V-Branch. Through the distillation, the knowledge from different scenes is unified into the V-Branch model. Hence, the V-Branch model can handle multiple ReID scenes simultaneously without using the scene labels. The "Cat" in the figure is the concatenate operation, and $l$ indicates the length of the image tokens.
  • Figure 3: The illustration of the proposed Multi-scene Prioris Data Augmentation (MPDA), which generates multiple augmented views for each source image to simulate the cross-scene variations. As a plug-and-play learning strategy, MPDA is applicable to typical self-supervised contrastive learning methods. The geometric data augmentation (random cropping, resizing, and horizontal flipping) is omitted for simplicity.
  • Figure 4: Ablation studies on the number of versatile prompts (left) and the loss weight $\alpha$ (right) in the scene-specific prompts distillation stage for training the V-Branch. We show the mAP on the joint testing set. By default, we utilize five versatile prompts and set the $\alpha$ as 1.0.
  • Figure 5: Visualizations of the self-attention map of the class token $e_{[\mathrm{CLS}]}$ in the last transformer block. Attention maps are the average value among all heads in the self-attention module. Each row represents two images of the same person in a specific scene and their attention maps. Each column represents applying the same scene-specific prompts to different input images. Prompts with blue background: attention maps in this column are from the ReID Bank. Attention maps with yellow background: the scene and prompts correspond with each other. The red boxes highlight that the vanilla ViT-B model pays the wrong attention to image parts. Such as focusing on clothes in the clothing-change scene, focusing on the occlusion part in the occlusion scene, and failing to capture shape information in the cross-modality scene.