Unleashing the Potential of Tracklets for Unsupervised Video Person Re-Identification
Nanxing Meng, Qizao Wang, Bin Li, Xiangyang Xue
TL;DR
This work tackles unsupervised video person re-identification by exploiting the rich temporal information in tracklets without relying on identities or camera labels. The core idea is to refine tracklet representations through Noise-Filtered Tracklet Partition (NFTP) to create sub-tracklets, and then apply Self-Supervised Refined Clustering (SSR-C) with a progressive sub-cluster merging strategy to generate reliable pseudo labels; learning is guided by a Class Smoothing Classification (CSC) loss that leverages multiple related sub-clusters per refined label. The approach, evaluated on MARS and DukeMTMC-VideoReID, sets new state-of-the-art performance among unsupervised methods and rivals supervised approaches, while maintaining inference efficiency. Overall, SSR-C demonstrates that carefully exploiting intra-tracklet consistency and locality via self-supervised clustering can effectively overcome noisy tracklets and yield robust cross-camera identity representations in a fully unsupervised setting.
Abstract
With rich temporal-spatial information, video-based person re-identification methods have shown broad prospects. Although tracklets can be easily obtained with ready-made tracking models, annotating identities is still expensive and impractical. Therefore, some video-based methods propose using only a few identity annotations or camera labels to facilitate feature learning. They also simply average the frame features of each tracklet, overlooking unexpected variations and inherent identity consistency within tracklets. In this paper, we propose the Self-Supervised Refined Clustering (SSR-C) framework without relying on any annotation or auxiliary information to promote unsupervised video person re-identification. Specifically, we first propose the Noise-Filtered Tracklet Partition (NFTP) module to reduce the feature bias of tracklets caused by noisy tracking results, and sequentially partition the noise-filtered tracklets into "sub-tracklets". Then, we cluster and further merge sub-tracklets using the self-supervised signal from the tracklet partition, which is enhanced through a progressive strategy to generate reliable pseudo labels, facilitating intra-class cross-tracklet aggregation. Moreover, we propose the Class Smoothing Classification (CSC) loss to efficiently promote model learning. Extensive experiments on the MARS and DukeMTMC-VideoReID datasets demonstrate that our proposed SSR-C for unsupervised video person re-identification achieves state-of-the-art results and is comparable to advanced supervised methods. The code is available at https://github.com/Darylmeng/SSRC-Reid.
