Table of Contents
Fetching ...

PAFormer: Part Aware Transformer for Person Re-identification

Hyeono Jung, Jangwon Lee, Jiwon Yoo, Dami Ko, Gyeonghwan Kim

TL;DR

PAFormer tackles partial ReID by introducing pose tokens that explicitly associate patch tokens with body parts, enabling precise part-to-part comparisons. It uses a learning-based visibility predictor and a teacher-forcing mechanism based on ground-truth visibility to handle occlusion, while inference does not require extra pose-localization modules. The method optimizes a joint loss including CLS ReID, partial ReID, pose supervision, and visibility, and computes sample distances as $d^{i,j} = d_{CLS}^{i,j} + { \sum_p d_p^{i,j} v_p^{i} v_p^{j} \over \sum_p v_p^{i} v_p^{j} }$. Experiments on Market-1501, DukeMTMC-ReID, and Occluded-Duke show state-of-the-art or competitive performance, highlighting improved robustness to occlusion and better part-level alignment. PAFormer advances ReID by integrating anatomical awareness into a transformer framework with no extra inference-time localization modules.

Abstract

Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce \textbf{Part Aware Transformer (PAFormer)}, a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called `pose token' which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.

PAFormer: Part Aware Transformer for Person Re-identification

TL;DR

PAFormer tackles partial ReID by introducing pose tokens that explicitly associate patch tokens with body parts, enabling precise part-to-part comparisons. It uses a learning-based visibility predictor and a teacher-forcing mechanism based on ground-truth visibility to handle occlusion, while inference does not require extra pose-localization modules. The method optimizes a joint loss including CLS ReID, partial ReID, pose supervision, and visibility, and computes sample distances as . Experiments on Market-1501, DukeMTMC-ReID, and Occluded-Duke show state-of-the-art or competitive performance, highlighting improved robustness to occlusion and better part-level alignment. PAFormer advances ReID by integrating anatomical awareness into a transformer framework with no extra inference-time localization modules.

Abstract

Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce \textbf{Part Aware Transformer (PAFormer)}, a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called `pose token' which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.
Paper Structure (27 sections, 8 equations, 8 figures, 5 tables)

This paper contains 27 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Left) Patch embedding in ViT : CNN filters of the same size as the patch are applied, extracting visual representation features. Right) The generated patch embeddings contain both body part-related channels and unrelated channels.
  • Figure 2: Each part tokens should be projected close to the patch tokens corresponding to the body part it represents in the embedding space. If body part-unrelated features are involved in the similarity computation of ViT, there is a risk of projecting based on specific appearances rather than body part.
  • Figure 3: Attention maps of vanilla ViT-based ReID model: We can observe that the first sample focuses on head, the second sample on lower body region, and the last sample on upper body region. This demonstrates that a parital ReID model trained solely on ReID loss is incapable of performing part-to-part comparisons effectively in practice.
  • Figure 4: Pipeline of PAFormer. We adopt pose tokens to estimate association between body parts and patch tokens. Partial features are generated by aggregating patch tokens during self-attention process are aggregated based on the predicted probabilities. Additionally, the output pose tokens pass through a visibility predictor to infer visibility scores.
  • Figure 5: The visualization depicts similarity between a query token (highlighted with a red border) and other patch tokens when Vanilla ViT is trained with ReID loss. Despite the consistent location of the query token, it's noticeable that patch token with high similarity varies based on the semantic of query token. For instance, in the first sample, a high similarity can be observed with tokens corresponding to the lower body, while in the second sample, tokens associated with the arms show higher similarity.
  • ...and 3 more figures