Body Part-Based Representation Learning for Occluded Person Re-Identification

Vladimir Somers; Christophe De Vleeschouwer; Alexandre Alahi

Body Part-Based Representation Learning for Occluded Person Re-Identification

Vladimir Somers, Christophe De Vleeschouwer, Alexandre Alahi

TL;DR

The paper tackles occluded person re-identification by shifting from global to body-part–level representations. It introduces BPBreID, a two-branch model with a learnable body-part attention module and a global-local representation learning module, trained under the GiLt framework that combines identity supervision on holistic embeddings with a part-based triplet objective. A key contribution is the part-averaged triplet loss, which stabilizes learning under occlusion by averaging distances across parts, and a visibility-based part-to-part matching strategy during inference. Empirically, BPBreID achieves state-of-the-art results on occluded datasets such as Occluded-Duke and demonstrates strong performance on holistic ReID benchmarks, with code released for reproducibility and further research.

Abstract

Occluded person re-identification (ReID) is a person retrieval task which aims at matching occluded person images with holistic ones. For addressing occluded ReID, part-based methods have been shown beneficial as they offer fine-grained information and are well suited to represent partially visible human bodies. However, training a part-based model is a challenging task for two reasons. Firstly, individual body part appearance is not as discriminative as global appearance (two distinct IDs might have the same local appearance), this means standard ReID training objectives using identity labels are not adapted to local feature learning. Secondly, ReID datasets are not provided with human topographical annotations. In this work, we propose BPBreID, a body part-based ReID model for solving the above issues. We first design two modules for predicting body part attention maps and producing body part-based features of the ReID target. We then propose GiLt, a novel training scheme for learning part-based representations that is robust to occlusions and non-discriminative local appearance. Extensive experiments on popular holistic and occluded datasets show the effectiveness of our proposed method, which outperforms state-of-the-art methods by 0.7% mAP and 5.6% rank-1 accuracy on the challenging Occluded-Duke dataset. Our code is available at https://github.com/VlSomers/bpbreid.

Body Part-Based Representation Learning for Occluded Person Re-Identification

TL;DR

Abstract

Paper Structure (25 sections, 10 equations, 5 figures, 4 tables)

This paper contains 25 sections, 10 equations, 5 figures, 4 tables.

Introduction
Related Work
Part-based feature alignment in ReID:
Local feature learning in ReID:
Methodology
Body Part Attention Module
Pixel-wise Part Classifier
Human Parsing Labels
Body Part Attention Loss
Global-local Representation Learning Module
Holistic and Body Part-based Features
Body Part Visibility Estimation
Overall Training Procedure
GiLt Loss
Part-Averaged Triplet Loss
...and 10 more sections

Figures (5)

Figure 1: Overview of key concepts in our work. First row illustrates the four challenges of occluded and part-based ReID that our proposed method is trying to address. Second row illustrates our pre-generated human parsing labels and the ReID-relevant soft attention maps produced by our model BPBreID.
Figure 2: Structure of BPBreID with detailed architecture and training procedure in the top part, and inference procedure in bottom part. The model consists of a body part attention module for body part attention maps and a global-local representation learning module for producing holistic features {$f_{\text{g}}$, $f_{\text{f}}$, $f_{\text{c}}$} and body part-based features {$f_1$, ..., $f_K$} together with their visibility scores {$v_{\text{f}}$, $v_1$, ..., $v_K$}. For holistic features, "$\text{g}$" stands for "global", "$\text{f}$" for "foreground" and "$\text{c}$" for "concatenated". GWAP stands for global weighted average pooling. The network is trained in an end-to-end fashion using a body part attention loss for supervising part prediction, a standard identity loss on holistic features and a part-averaged triplet loss on body part-based features. Query to gallery distance is computed at inference using a part-to-part matching strategy for comparing only mutually visible body parts. Green/red color depict visible/invisible body parts. Each component of the architecture is framed with a grey rectangle, with its name and a number referencing the section describing it. For conciseness, BPBreID is represented here with $K=4$: {head, torso, legs, feet}.
Figure 3: Visualization of ranking results based on individual body part-based embeddings (top row of each query) or all body part-based embeddings with the foreground embedding (bottom row of each query). For the "all parts" rows, only the foreground attention map is displayed for conciseness. In the top row of each query, the retrieved gallery samples are very similar w.r.t. the compared body part, but identities do not match because a single body part is not discriminative enough. Green/red borders are correct/incorrect matches. Best viewed in color and zoomed in.
Figure 4: We compare the ranking performance of our model BPBreID with other methods: the part-based transformer method with part discovery PAT PAT and our baseline, the global method BoT BoT. As illustrated in this figure, BoT cannot handle occlusions and PAT is inferior in terms of detecting and aligning fine-grained local appearance features.
Figure 5: We compare the attentions maps produced by our model BPBreID (on test images unseen at training) with the attention maps from other state-of-the-art part-based methods: ISP ISP and PAT PAT. "Fg" refers to the foreground attention maps, which is obtained by fusing maps from all parts together. Green/red borders illustrate visible/unvisible parts and no color is displayed for PAT because this method is not designed with a visibility score mechanism. Both ISP and PAT use part-discovery to define the human semantic regions, which can lead to missed part, background clutter or feature misalignment. As illustrated in this figure, our attention maps doesn't suffer from these issues. However, unlike these methods, our method only detects body parts and no belongings, such as bags or umbrellas. Moreover, most part-based methods (such as PAT PAT, ISP ISP, HOReID HOReID, ...) tries to make each part-based embedding discriminative on its own. This is performed by either incorporating global information into each local embedding HOReID, or by having each part attending to multiple regions of target person body PAT, or by mining discriminative local features ISP, as illustrated in this Figure. Different from these methods, we learn part-based embeddings that well represent their associated body-part, without the requirement of being discriminative on their own, but with the requirement of being discriminative when used as a whole. The PifPaf row illustrate the coarse PifPaf part confidence and affinity fields described in the first section of these supplementary materials (tensor $E$ for $K=5$), from which we derive our human parsing labels used at training.

Body Part-Based Representation Learning for Occluded Person Re-Identification

TL;DR

Abstract

Body Part-Based Representation Learning for Occluded Person Re-Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)