Table of Contents
Fetching ...

Unleashing the Potential of Tracklets for Unsupervised Video Person Re-Identification

Nanxing Meng, Qizao Wang, Bin Li, Xiangyang Xue

TL;DR

This work tackles unsupervised video person re-identification by exploiting the rich temporal information in tracklets without relying on identities or camera labels. The core idea is to refine tracklet representations through Noise-Filtered Tracklet Partition (NFTP) to create sub-tracklets, and then apply Self-Supervised Refined Clustering (SSR-C) with a progressive sub-cluster merging strategy to generate reliable pseudo labels; learning is guided by a Class Smoothing Classification (CSC) loss that leverages multiple related sub-clusters per refined label. The approach, evaluated on MARS and DukeMTMC-VideoReID, sets new state-of-the-art performance among unsupervised methods and rivals supervised approaches, while maintaining inference efficiency. Overall, SSR-C demonstrates that carefully exploiting intra-tracklet consistency and locality via self-supervised clustering can effectively overcome noisy tracklets and yield robust cross-camera identity representations in a fully unsupervised setting.

Abstract

With rich temporal-spatial information, video-based person re-identification methods have shown broad prospects. Although tracklets can be easily obtained with ready-made tracking models, annotating identities is still expensive and impractical. Therefore, some video-based methods propose using only a few identity annotations or camera labels to facilitate feature learning. They also simply average the frame features of each tracklet, overlooking unexpected variations and inherent identity consistency within tracklets. In this paper, we propose the Self-Supervised Refined Clustering (SSR-C) framework without relying on any annotation or auxiliary information to promote unsupervised video person re-identification. Specifically, we first propose the Noise-Filtered Tracklet Partition (NFTP) module to reduce the feature bias of tracklets caused by noisy tracking results, and sequentially partition the noise-filtered tracklets into "sub-tracklets". Then, we cluster and further merge sub-tracklets using the self-supervised signal from the tracklet partition, which is enhanced through a progressive strategy to generate reliable pseudo labels, facilitating intra-class cross-tracklet aggregation. Moreover, we propose the Class Smoothing Classification (CSC) loss to efficiently promote model learning. Extensive experiments on the MARS and DukeMTMC-VideoReID datasets demonstrate that our proposed SSR-C for unsupervised video person re-identification achieves state-of-the-art results and is comparable to advanced supervised methods. The code is available at https://github.com/Darylmeng/SSRC-Reid.

Unleashing the Potential of Tracklets for Unsupervised Video Person Re-Identification

TL;DR

This work tackles unsupervised video person re-identification by exploiting the rich temporal information in tracklets without relying on identities or camera labels. The core idea is to refine tracklet representations through Noise-Filtered Tracklet Partition (NFTP) to create sub-tracklets, and then apply Self-Supervised Refined Clustering (SSR-C) with a progressive sub-cluster merging strategy to generate reliable pseudo labels; learning is guided by a Class Smoothing Classification (CSC) loss that leverages multiple related sub-clusters per refined label. The approach, evaluated on MARS and DukeMTMC-VideoReID, sets new state-of-the-art performance among unsupervised methods and rivals supervised approaches, while maintaining inference efficiency. Overall, SSR-C demonstrates that carefully exploiting intra-tracklet consistency and locality via self-supervised clustering can effectively overcome noisy tracklets and yield robust cross-camera identity representations in a fully unsupervised setting.

Abstract

With rich temporal-spatial information, video-based person re-identification methods have shown broad prospects. Although tracklets can be easily obtained with ready-made tracking models, annotating identities is still expensive and impractical. Therefore, some video-based methods propose using only a few identity annotations or camera labels to facilitate feature learning. They also simply average the frame features of each tracklet, overlooking unexpected variations and inherent identity consistency within tracklets. In this paper, we propose the Self-Supervised Refined Clustering (SSR-C) framework without relying on any annotation or auxiliary information to promote unsupervised video person re-identification. Specifically, we first propose the Noise-Filtered Tracklet Partition (NFTP) module to reduce the feature bias of tracklets caused by noisy tracking results, and sequentially partition the noise-filtered tracklets into "sub-tracklets". Then, we cluster and further merge sub-tracklets using the self-supervised signal from the tracklet partition, which is enhanced through a progressive strategy to generate reliable pseudo labels, facilitating intra-class cross-tracklet aggregation. Moreover, we propose the Class Smoothing Classification (CSC) loss to efficiently promote model learning. Extensive experiments on the MARS and DukeMTMC-VideoReID datasets demonstrate that our proposed SSR-C for unsupervised video person re-identification achieves state-of-the-art results and is comparable to advanced supervised methods. The code is available at https://github.com/Darylmeng/SSRC-Reid.
Paper Structure (44 sections, 10 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 44 sections, 10 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Limitations of directly applying clustering on tracklets to obtain pseudo labels. Different shapes denote different identities, and the same shapes with various colors represent different tracklets of the same pedestrian. Due to the noises and unexpected variations within tracklets (marked in yellow ovals), (a) different individuals are wrongly clustered, and (b) tracklets of the same person are incorrectly pushed away.
  • Figure 2: The framework of our proposed SSR-C. During each training epoch, our proposed Noise-Filtered Tracklet Partition (NFTP) module would filter noisy frames within each tracklet, and further sequentially partition each tracklet into multiple sub-tracklets. Different shapes (e.g., triangle and circle) correspond to different tracklets, different colors (e.g., red and blue) denote different identities, and various transparencies of the same color represent the sub-tracklets of each tracklet. With noise-filtered sub-tracklets, we can acquire reliable sub-clusters via strictly restricted clustering, and then sub-clusters are merged leveraging the self-supervised signal from tracklet partitioning and following a progressive merging strategy. Finally, we design the Class Smoothing Classification (CSC) loss, which leverages the obtained accurate pseudo labels to effectively improve the discriminative ability of the model.
  • Figure 3: Illustration of our proposed progressive sub-cluster merging. Different shapes (e.g., triangle and circle) correspond to different tracklets, different colors (e.g., red and blue) denote different identities, and various transparencies of the same color represent the sub-tracklets of each tracklet. Based on the generated reliable sub-clusters, only directly reachable sub-clusters are merged during the early stages of training, while all reachable sub-clusters are merged later.
  • Figure 4: Influence of different values of (a) $l$, (b) $K$, (c) $\lambda$. The blue and red dashed lines in (b) denote the Rank-1 and mAP accuracies of our proposed method, respectively. The ablation experiments are performed on the Duke-V dataset and the optimal hyper-parameter values are directly applied to MARS.
  • Figure 5: (a) Influence of different values of $\delta$ on the performance, and (b) the average number of filtered frames at different epochs with various values of $\delta$. The blue and red dashed lines in (a) denote the Rank-1 and mAP accuracies of the variant without using NF, respectively. The ablation is performed on Duke-V and the optimal value are directly applied to MARS without further tuning.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2