Cross-video Identity Correlating for Person Re-identification Pre-training

Jialong Zuo; Ying Nie; Hanyu Zhou; Huaxin Zhang; Haoyu Wang; Tianyu Guo; Nong Sang; Changxin Gao

Cross-video Identity Correlating for Person Re-identification Pre-training

Jialong Zuo, Ying Nie, Hanyu Zhou, Huaxin Zhang, Haoyu Wang, Tianyu Guo, Nong Sang, Changxin Gao

TL;DR

The paper tackles pre-training for person re-identification by addressing the lack of cross-video identity invariance in existing approaches. It introduces CION, a framework that explicitly seeks identity correlation across videos through progressive multi-level denoising and an identity-guided self-distillation loss, enabling more compact and effective large-scale pre-training. Empirical results show leading performance with substantially fewer training samples and impressive model-structure compatibility, supported by the open-source ReIDZoo containing 32 pre-trained models across 10 architectures. This work significantly improves efficiency and generalizability of ReID pre-training and offers a practical, reusable resource for diverse ReID research and applications.

Abstract

Recent researches have proven that pre-training on large-scale person images extracted from internet videos is an effective way in learning better representations for person re-identification. However, these researches are mostly confined to pre-training at the instance-level or single-video tracklet-level. They ignore the identity-invariance in images of the same person across different videos, which is a key focus in person re-identification. To address this issue, we propose a Cross-video Identity-cOrrelating pre-traiNing (CION) framework. Defining a noise concept that comprehensively considers both intra-identity consistency and inter-identity discrimination, CION seeks the identity correlation from cross-video images by modeling it as a progressive multi-level denoising problem. Furthermore, an identity-guided self-distillation loss is proposed to implement better large-scale pre-training by mining the identity-invariance within person images. We conduct extensive experiments to verify the superiority of our CION in terms of efficiency and performance. CION achieves significantly leading performance with even fewer training samples. For example, compared with the previous state-of-the-art~\cite{ISR}, CION with the same ResNet50-IBN achieves higher mAP of 93.3\% and 74.3\% on Market1501 and MSMT17, while only utilizing 8\% training samples. Finally, with CION demonstrating superior model-agnostic ability, we contribute a model zoo named ReIDZoo to meet diverse research and application needs in this field. It contains a series of CION pre-trained models with spanning structures and parameters, totaling 32 models with 10 different structures, including GhostNet, ConvNext, RepViT, FastViT and so on. The code and models will be made publicly available at https://github.com/Zplusdragon/CION_ReIDZoo.

Cross-video Identity Correlating for Person Re-identification Pre-training

TL;DR

Abstract

Paper Structure (21 sections, 9 equations, 8 figures, 6 tables)

This paper contains 21 sections, 9 equations, 8 figures, 6 tables.

Introduction
Related work
CION: Cross-video Identity-cOrrelating pre-traiNing
Noise definition
Progressive multi-level denoising
Identity-guided self-distillation
Experiments
Experimental setup
Comparison with state-of-the-art methods
Generalizing to different model structures
Ablation studies and analyses
Conclusion
Appendix
Broader impact
Limitations
...and 6 more sections

Figures (8)

Figure 1: Comparisons between our proposed CION with other pre-training methods. In (a), the instance-level method mines instance-invariance by contrastive learning on augmented views of each image, completely ignoring the invariance within different images of the same person. In (b), the single-video tracklet-level method mines tracklet-invariance by contrastive learning on images of each tracklet in single video, significantly ignoring the invariance in images of the same person across different videos. In (c), our CION learns identity-invariance by correlating the images of the same person across different videos, thus leading to better representation learning.
Figure 2: A toy example for Sliding Range and Linking Relation.
Figure 3: Self-distillation with identity guidance. The overall structure shares a similarity with DINO, while the concept of identity is introduced. We illustrate it in the case of $N_{id}=2$ and pairs of views $(\mathbf{x}_t,\mathbf{x}_s)$ for simplicity. $T_t$ and $T_s$ represent different random transformations. All transformed views from images of the same person will engage in contrastive learning.
Figure 4: The t-SNE tSNE visualization of extracted image features. Our CION enjoys good identity consistency and discrimination, while the instance-level method LUP LUP and tracklet-level method LUP-NL LUPNL do not (marked in red).
Figure 5: Results with different denoising strategies: (1) without denoising, (2) add single-tracklet denoising, (3) further add single-video denoising, (4) further add cross-video denoising.
...and 3 more figures

Cross-video Identity Correlating for Person Re-identification Pre-training

TL;DR

Abstract

Cross-video Identity Correlating for Person Re-identification Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (8)