Synthetic-To-Real Video Person Re-ID

Xiangqun Zhang; Wei Feng; Ruize Han; Likai Wang; Linqi Song; Junhui Hou

Synthetic-To-Real Video Person Re-ID

Xiangqun Zhang, Wei Feng, Ruize Han, Likai Wang, Linqi Song, Junhui Hou

TL;DR

This work tackles cross-domain video-based person Re-ID by leveraging synthetic data to train models that generalize to real-world videos. The authors propose a framework that combines multi-level domain-invariant feature learning with a mean-teacher consistency scheme, augmented by clustering-based ID consistency losses to exploit unlabeled real data. They introduce SVReID and SVReID+ benchmarks to facilitate synthetic-to-real evaluation and demonstrate state-of-the-art performance across five real datasets, with synthetic data sometimes outperforming real data in cross-domain transfers. The study highlights the practical benefits of synthetic video data for scalable Re-ID and provides a benchmark for future research in cross-domain video person Re-ID.

Abstract

Person re-identification (Re-ID) is an important task and has significant applications for public security and information forensics, which has progressed rapidly with the development of deep learning. In this work, we investigate a novel and challenging setting of Re-ID, i.e., cross-domain video-based person Re-ID. Specifically, we utilize synthetic video datasets as the source domain for training and real-world videos for testing, notably reducing the reliance on expensive real data acquisition and annotation. To harness the potential of synthetic data, we first propose a self-supervised domain-invariant feature learning strategy for both static and dynamic (temporal) features. Additionally, to enhance person identification accuracy in the target domain, we propose a mean-teacher scheme incorporating a self-supervised ID consistency loss. Experimental results across five real datasets validate the rationale behind cross-synthetic-real domain adaptation and demonstrate the efficacy of our method. Notably, the discovery that synthetic data outperforms real data in the cross-domain scenario is a surprising outcome. The code and data are publicly available at https://github.com/XiangqunZhang/UDA_Video_ReID.

Synthetic-To-Real Video Person Re-ID

TL;DR

Abstract

Paper Structure (15 sections, 17 equations, 4 figures, 5 tables)

This paper contains 15 sections, 17 equations, 4 figures, 5 tables.

Introduction
Related Work
Proposed Method
Problem formulation
Multi-level domain invariant feature learning
Consistency learning on unlabeled real data
Implementation details
Experimental Results
Setup
Comparison with state-of-the-art methods
Ablation study
Cross-dataset evaluation results
Enlarged SVReID dataset
Qualitative analysis
Conclusion

Figures (4)

Figure 1: Illustration of different settings for person Re-ID tasks, including the (single domain) image/video person Re-ID and unsupervised domain adaptive (UDA) image person Re-ID in previous works. In this work, we focus on a new task of video-based UDA person Re-ID. Specifically, we are interested in the scene using synthetic videos as the source domain, which is more economical and practical.
Figure 2: Framework of the proposed method. We use the synthetic videos with ID labels (source domain) and the real videos (target domain) without annotation as input. The main network is trained using the supervised losses on the source domain data. Also, for video representation learning on the domain-mixed data, we design a series of self-supervised losses $\mathcal{L}^{\rm S}$ to distinguish the domain labels under different levels, including frame, video, and combined video. We further apply a mean-teacher strategy for the network training on the unlabeled real data. By taking the main network as the student, we use the EMA to obtain a teacher network to "supervise" it, which is achieved by the self-supervised consistency loss $\mathcal{L}^{\rm C}$ between them.
Figure 3: Illustration of some examples in SVReID of the same identity with various view angles, lighting conditions, occlusion/clutters, and clothes changing.
Figure 4: Qualitative analysis of the baseline and the proposed method, under complex background (a), occlusion (b), illumination variation (c), and clothes changing (d).

Synthetic-To-Real Video Person Re-ID

TL;DR

Abstract

Synthetic-To-Real Video Person Re-ID

Authors

TL;DR

Abstract

Table of Contents

Figures (4)