Enhancing Long-Term Person Re-Identification Using Global, Local Body Part, and Head Streams

Duy Tran Thanh; Yeejin Lee; Byeongkeun Kang

Enhancing Long-Term Person Re-Identification Using Global, Local Body Part, and Head Streams

Duy Tran Thanh, Yeejin Lee, Byeongkeun Kang

TL;DR

This paper tackles long-term person re-identification by addressing clothes-changing and clothes-consistent scenarios. It introduces the Parts-Aligned and Head (PAH) network, a three-stream architecture consisting of global, local body-part, and head streams, each encoding distinct identity cues and trained with a combination of $L_{id}$, $L_{pair}$, and $L_{psd}$. A pseudo-label-based body-part segmentation head enables implicit part alignment without external annotations, while an explicit head stream leverages facial/head information. Across Celeb-reID, PRCC, and VC-Clothes, PAH-Net shows state-of-the-art performance, with ablations confirming the complementary value of each stream and the effectiveness of adversarial erasing. The work advances practical long-term re-identification for surveillance and autonomous service robots by robustly integrating global and localized cues with pragmatic training strategies.

Abstract

This work addresses the task of long-term person re-identification. Typically, person re-identification assumes that people do not change their clothes, which limits its applications to short-term scenarios. To overcome this limitation, we investigate long-term person re-identification, which considers both clothes-changing and clothes-consistent scenarios. In this paper, we propose a novel framework that effectively learns and utilizes both global and local information. The proposed framework consists of three streams: global, local body part, and head streams. The global and head streams encode identity-relevant information from an entire image and a cropped image of the head region, respectively. Both streams encode the most distinct, less distinct, and average features using the combinations of adversarial erasing, max pooling, and average pooling. The local body part stream extracts identity-related information for each body part, allowing it to be compared with the same body part from another image. Since body part annotations are not available in re-identification datasets, pseudo-labels are generated using clustering. These labels are then utilized to train a body part segmentation head in the local body part stream. The proposed framework is trained by backpropagating the weighted summation of the identity classification loss, the pair-based loss, and the pseudo body part segmentation loss. To demonstrate the effectiveness of the proposed method, we conducted experiments on three publicly available datasets (Celeb-reID, PRCC, and VC-Clothes). The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art method.

Enhancing Long-Term Person Re-Identification Using Global, Local Body Part, and Head Streams

TL;DR

, and

. A pseudo-label-based body-part segmentation head enables implicit part alignment without external annotations, while an explicit head stream leverages facial/head information. Across Celeb-reID, PRCC, and VC-Clothes, PAH-Net shows state-of-the-art performance, with ablations confirming the complementary value of each stream and the effectiveness of adversarial erasing. The work advances practical long-term re-identification for surveillance and autonomous service robots by robustly integrating global and localized cues with pragmatic training strategies.

Abstract

Paper Structure (19 sections, 16 equations, 9 figures, 7 tables)

This paper contains 19 sections, 16 equations, 9 figures, 7 tables.

INTRODUCTION
RELATED WORKS
Long-Term Person Re-Identification
Architectures with Multiple Streams in Other Applications
PROPOSED METHOD
Network Architecture
Global Stream
Local Body Part Stream
Head Stream
Training
Pseudo-Label Generation for Training
Loss Function
Inference
EXPERIMENTS AND RESULTS
Dataset
...and 4 more sections

Figures (9)

Figure 1: Illustration of the proposed framework. The proposed Parts-Aligned and Head Network (PAH-Net) consists of three streams: the global, local body part, and head streams. The network is trained by backpropagating the combination of three losses: identity classification loss, pair-based loss, and body part segmentation loss. During training, pseudo pixel-level body part labels are generated and utilized along with training images and their image-level identity labels.
Figure 2: The proposed framework during training. GMP, GAP, and AE denote global max pooling, global average pooling, and adversarial erasing, respectively. BN+FC represents batch normalization followed by a fully connected layer. Aggregation and CONCAT denote feature aggregation over each body part and concatenation, respectively. $\odot$ denotes an element-wise multiplication for each channel. $\mathcal{L}_{id}$, $\mathcal{L}_{pair}$, and $\mathcal{L}_{psd}$ represent the identity classification loss, the pair-based loss, and the pseudo body part segmentation loss, respectively.
Figure 3: Illustration of the networks $f_{g_1}(\cdot)$, $f_{g_{21}}(\cdot)$, and $f_{g_{22}}(\cdot)$ based on the OSNet backbone zhou2019omni given an input image $\boldsymbol{I}$. $f_{bl}(\cdot)$ represents a block in OSNet zhou2019omni. $\hat{f}_{bl}(\cdot)$ and $\bar{f}_{bl}(\cdot)$ denote the first layer and the remaining layers of a block, respectively. (a) Structure of the networks $f_{g_1}(\cdot)$, $f_{g_{21}}(\cdot)$, and $f_{g_{22}}(\cdot)$; (b) Structure of a block $f_{bl}(\cdot)$.
Figure 4: Illustration of adversarial erasing in the global stream. (a) Input image; (b) Sum of squared activation values in Eq. (3) along the channel axis; (c) Adversarially erased input image.
Figure 5: Illustration of the network $f_{l}(\cdot)$ based on the HRNet-W32 backbone sun2019deep to obtain a dense feature representation $\boldsymbol{F}^{dense}$ given an input image $\boldsymbol{I}$. $f_{l}^{s}(\cdot)$, $f_{l}^{1}(\cdot)$, and $f_{l}^{f}(\cdot)$ represent the stem layers, the first stage, and the final layers, respectively. $f_{l}^{mn}(\cdot)$ denotes the $n$-th parallel branch in the $m$-th stage. $f_{l}^{mn}(\cdot)$ processes the feature maps at a reduced resolution by $1/2^{(n-1)}$ for $n > 1$. The blue dotted line represents bilinear upsampling and a convolution layer. The black solid and dashed lines denote convolution layers with strides of 1 and 2, respectively.
...and 4 more figures

Enhancing Long-Term Person Re-Identification Using Global, Local Body Part, and Head Streams

TL;DR

Abstract

Enhancing Long-Term Person Re-Identification Using Global, Local Body Part, and Head Streams

Authors

TL;DR

Abstract

Table of Contents

Figures (9)