Table of Contents
Fetching ...

PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification

Bin Hu, Xinggang Wang, Wenyu Liu

TL;DR

PersonViT introduces a large-scale, self-supervised ViT for person ReID by integrating Masked Image Modeling with DINO-style contrastive learning. Pretraining on the unlabeled LUPerson dataset enables the model to learn rich local and global representations, which are then fine-tuned on four standard ReID benchmarks to achieve state-of-the-art results, notably under occlusion. The method demonstrates strong generalization and interpretable local-feature discovery, with visualization analyses corroborating automatic localization of body parts and robust attention to human contours. The work also discusses optimization trade-offs and practical considerations, offering a path toward scalable ReID with reduced labeling requirements and sharing code and pretrained models for broader adoption.

Abstract

Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at \url{https://github.com/hustvl/PersonViT} to promote further research in the person ReID field.

PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification

TL;DR

PersonViT introduces a large-scale, self-supervised ViT for person ReID by integrating Masked Image Modeling with DINO-style contrastive learning. Pretraining on the unlabeled LUPerson dataset enables the model to learn rich local and global representations, which are then fine-tuned on four standard ReID benchmarks to achieve state-of-the-art results, notably under occlusion. The method demonstrates strong generalization and interpretable local-feature discovery, with visualization analyses corroborating automatic localization of body parts and robust attention to human contours. The work also discusses optimization trade-offs and practical considerations, offering a path toward scalable ReID with reduced labeling requirements and sharing code and pretrained models for broader adoption.

Abstract

Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and discriminative contrastive learning, and then conducts supervised fine-tuning training in the person ReID task. This person feature extraction method based on ViT with masked image modeling (PersonViT) has the good characteristics of unsupervised, scalable, and strong generalization capabilities, overcoming the problem of difficult annotation in supervised person ReID, and achieves state-of-the-art results on publicly available benchmark datasets, including MSMT17, Market1501, DukeMTMC-reID, and Occluded-Duke. The code and pre-trained models of the PersonViT method are released at \url{https://github.com/hustvl/PersonViT} to promote further research in the person ReID field.
Paper Structure (27 sections, 8 equations, 6 figures, 5 tables)

This paper contains 27 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Person ReID performance on both MSMT17 and Market1501. The proposed PersonViT method obtains SOTA results and significantly outperforms previous methods.
  • Figure 2: Overview of PersonViT framework.
  • Figure 3: Supervised accuracy of Person Re-ID on the different pre-trained epochs.
  • Figure 4: Visualization for pattern layout of patch tokens cluster.
  • Figure 5: Visualization for self-attention map from complex background.
  • ...and 1 more figures