Table of Contents
Fetching ...

Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification

Xin Zhang, Keren Fu, Qijun Zhao

TL;DR

Dynamic Patch-aware Enrichment Transformer (DPEFormer) addresses occluded person re-ID by learning patch-level body cues without external detectors, using DPSM to select informative patch tokens, FBM to fuse global and local features, and ROA to generate realistic occlusions for robust training. The approach leverages a memory-bank contrastive loss alongside identity supervision to learn discriminative representations, achieving state-of-the-art results on Occluded-DukeMTMC and strong performance on holistic benchmarks. The combination of detector-free patch selection, enriched feature blending, and SAM-based augmentation enables robust occlusion handling with end-to-end training, offering practical benefits for real-world surveillance where occlusions are common.

Abstract

Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.

Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification

TL;DR

Dynamic Patch-aware Enrichment Transformer (DPEFormer) addresses occluded person re-ID by learning patch-level body cues without external detectors, using DPSM to select informative patch tokens, FBM to fuse global and local features, and ROA to generate realistic occlusions for robust training. The approach leverages a memory-bank contrastive loss alongside identity supervision to learn discriminative representations, achieving state-of-the-art results on Occluded-DukeMTMC and strong performance on holistic benchmarks. The combination of detector-free patch selection, enriched feature blending, and SAM-based augmentation enables robust occlusion handling with end-to-end training, offering practical benefits for real-world surveillance where occlusions are common.

Abstract

Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.
Paper Structure (18 sections, 16 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 18 sections, 16 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Applying pose estimation model sun2019deep ((a), (d), (g)) and human parsing model li2020self ((b), (e), (h)) to extract body information. Both models perform well when presented with holistic and object-occluded images but diminish when dealing with multi-pedestrian images. By contrast, our DPEFormer selects more accurate patches corresponding to body region ((c), (f), (i)).
  • Figure 2: Framework of the proposed DPEFormer, which consists of four components: dynamic patch token selection module (DPSM), feature blending modules (FBMs), memory bank with contrastive loss, and realistic occlusion augmentation (ROA). Note that DPSM, FBM and ROA are three distinct contributions of this paper, which characterize the proposed DPEFormer.
  • Figure 3: Illustration of the dynamic selection (DS) process for patch tokens, where notation "b" refers to the batch size. To perform DS, we calculate the similarity scores between the proxy token with all patch tokens, and such scores are sorted in descending order and serve as support for selecting essential tokens.
  • Figure 4: Visualization of selected patches by DPSM. Examples on the left of the dash line show that the majority of selected patches exhibit alignment with pedestrian bodies, whereas examples on the right of the line show some visually failed cases where the selected patches do not align well with the desired regions.
  • Figure 5: Comparison between the proposed ROA (upper) and existing Cut&Paste (bottom). For ROA, we pre-generate a mask set using SAM kirillov2023segment, from which we randomly select masks for occlusion synthesis during training. Detailed algorithm can be found in Algorithm \ref{['alg:algorithm']}.
  • ...and 1 more figures