Table of Contents
Fetching ...

A review of Recent Techniques for Person Re-Identification

Andrea Asperti, Salvatore Fiorilla, Simone Nardi, Lorenzo Orsini

TL;DR

This survey analyzes the landscape of person re-identification, contrasting mature supervised approaches with the growing, still-maturing unsupervised methods. It structurally classifies supervised ReID into feature learning and metric learning, noting transformer-based architectures and cross-domain adaptation as key trends. For unsupervised ReID, it distinguishes unsupervised domain adaptation from fully unsupervised learning, detailing mean-teacher, memory-based contrastive, SPL, and clustering-based strategies, and discusses challenges in pseudo-label quality and cross-domain generalization. Overall, supervised methods have approached saturation on standard benchmarks, while fully unsupervised approaches show competitive performance on Market1501 but still lag on DukeMTMC, signaling important avenues for future research in robust pseudo-labeling, domain transfer, and human-centric perception.

Abstract

Person re-identification (ReId), a crucial task in surveillance, involves matching individuals across different camera views. The advent of Deep Learning, especially supervised techniques like Convolutional Neural Networks and Attention Mechanisms, has significantly enhanced person Re-ID. However, the success of supervised approaches hinges on vast amounts of annotated data, posing scalability challenges in data labeling and computational costs. To address these limitations, recent research has shifted towards unsupervised person re-identification. Leveraging abundant unlabeled data, unsupervised methods aim to overcome the need for pairwise labelled data. Although traditionally trailing behind supervised approaches, unsupervised techniques have shown promising developments in recent years, signalling a narrowing performance gap. Motivated by this evolving landscape, our survey pursues two primary objectives. First, we review and categorize significant publications in supervised person re-identification, providing an in-depth overview of the current state-of-the-art and emphasizing little room for further improvement in this domain. Second, we explore the latest advancements in unsupervised person re-identification over the past three years, offering insights into emerging trends and shedding light on the potential convergence of performance between supervised and unsupervised paradigms. This dual-focus survey aims to contribute to the evolving narrative of person re-identification, capturing both the mature landscape of supervised techniques and the promising outcomes in the realm of unsupervised learning.

A review of Recent Techniques for Person Re-Identification

TL;DR

This survey analyzes the landscape of person re-identification, contrasting mature supervised approaches with the growing, still-maturing unsupervised methods. It structurally classifies supervised ReID into feature learning and metric learning, noting transformer-based architectures and cross-domain adaptation as key trends. For unsupervised ReID, it distinguishes unsupervised domain adaptation from fully unsupervised learning, detailing mean-teacher, memory-based contrastive, SPL, and clustering-based strategies, and discusses challenges in pseudo-label quality and cross-domain generalization. Overall, supervised methods have approached saturation on standard benchmarks, while fully unsupervised approaches show competitive performance on Market1501 but still lag on DukeMTMC, signaling important avenues for future research in robust pseudo-labeling, domain transfer, and human-centric perception.

Abstract

Person re-identification (ReId), a crucial task in surveillance, involves matching individuals across different camera views. The advent of Deep Learning, especially supervised techniques like Convolutional Neural Networks and Attention Mechanisms, has significantly enhanced person Re-ID. However, the success of supervised approaches hinges on vast amounts of annotated data, posing scalability challenges in data labeling and computational costs. To address these limitations, recent research has shifted towards unsupervised person re-identification. Leveraging abundant unlabeled data, unsupervised methods aim to overcome the need for pairwise labelled data. Although traditionally trailing behind supervised approaches, unsupervised techniques have shown promising developments in recent years, signalling a narrowing performance gap. Motivated by this evolving landscape, our survey pursues two primary objectives. First, we review and categorize significant publications in supervised person re-identification, providing an in-depth overview of the current state-of-the-art and emphasizing little room for further improvement in this domain. Second, we explore the latest advancements in unsupervised person re-identification over the past three years, offering insights into emerging trends and shedding light on the potential convergence of performance between supervised and unsupervised paradigms. This dual-focus survey aims to contribute to the evolving narrative of person re-identification, capturing both the mature landscape of supervised techniques and the promising outcomes in the realm of unsupervised learning.

Paper Structure

This paper contains 37 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Vanilla model for supervised person ReID. During training Triplet loss works by selecting an anchor, a positive (same identity), and a negative (different identity) feature vector from the batch. For classification loss each identity in the training dataset is a class.
  • Figure 2: Structure of PCB. The feature maps extracted by the backbone network are segmented into six horizontal stripes, and each stripe undergoes horizontal average pooling, producing a single-channel vector for each stripe. Figure taken from sun2018beyond
  • Figure 3: SSP-ReID framework. One branch is responsible for human semantic parsing, while the other for saliency detection. Figure taken from quispe2019improved
  • Figure 4: In the Vision Transformer (ViT) framework, the image is partitioned into a sequence of non-overlapping patches, forming the input for the ViT model. The model processes these patches, assigning each one a corresponding token. Additionally, a global token, denoted by the letter G and highlighted in violet, represents the overall information from the entire image within the sequence of tokens. Figure taken from sharma2021person
  • Figure 5: In the mean teacher-student framework, both models output the prediction for the same input (corrupted with different noise $\eta$ and $\eta'$). Then the consistency loss is calculated, to align the two predictions. Finally, the student model is updated via gradient descent, while the teacher model via exponential moving average (EMA). Figure taken from tarvainen2017mean
  • ...and 2 more figures