Table of Contents
Fetching ...

Local-Aware Global Attention Network for Person Re-Identification Based on Body and Hand Images

Nathanael L. Baisa

TL;DR

The paper tackles robust person re-identification under challenging conditions by leveraging both body and hand images. It introduces LAGA-Net, a four-branch architecture that combines channel attention, spatial attention with relative positional encodings, a global branch, and a local stripe-based branch to learn comprehensive feature embeddings. The model is trained end-to-end with a joint loss comprising cross-entropy (with label smoothing) and hard-mining triplet losses, and during testing the six 2048-D branch embeddings are concatenated into a 12288-D final descriptor for cosine similarity. Experiments on four body-based datasets and two hand datasets demonstrate state-of-the-art performance, with ablation analyses confirming the contribution of each component; the approach reduces reliance on external pose cues and is effective in uncontrolled environments.

Abstract

Learning representative, robust and discriminative information from images is essential for effective person re-identification (Re-Id). In this paper, we propose a compound approach for end-to-end discriminative deep feature learning for person Re-Id based on both body and hand images. We carefully design the Local-Aware Global Attention Network (LAGA-Net), a multi-branch deep network architecture consisting of one branch for spatial attention, one branch for channel attention, one branch for global feature representations and another branch for local feature representations. The attention branches focus on the relevant features of the image while suppressing the irrelevant backgrounds. In order to overcome the weakness of the attention mechanisms, equivariant to pixel shuffling, we integrate relative positional encodings into the spatial attention module to capture the spatial positions of pixels. The global branch intends to preserve the global context or structural information. For the the local branch, which intends to capture the fine-grained information, we perform uniform partitioning to generate stripes on the conv-layer horizontally. We retrieve the parts by conducting a soft partition without explicitly partitioning the images or requiring external cues such as pose estimation. A set of ablation study shows that each component contributes to the increased performance of the LAGA-Net. Extensive evaluations on four popular body-based person Re-Id benchmarks and two publicly available hand datasets demonstrate that our proposed method consistently outperforms existing state-of-the-art methods.

Local-Aware Global Attention Network for Person Re-Identification Based on Body and Hand Images

TL;DR

The paper tackles robust person re-identification under challenging conditions by leveraging both body and hand images. It introduces LAGA-Net, a four-branch architecture that combines channel attention, spatial attention with relative positional encodings, a global branch, and a local stripe-based branch to learn comprehensive feature embeddings. The model is trained end-to-end with a joint loss comprising cross-entropy (with label smoothing) and hard-mining triplet losses, and during testing the six 2048-D branch embeddings are concatenated into a 12288-D final descriptor for cosine similarity. Experiments on four body-based datasets and two hand datasets demonstrate state-of-the-art performance, with ablation analyses confirming the contribution of each component; the approach reduces reliance on external pose cues and is effective in uncontrolled environments.

Abstract

Learning representative, robust and discriminative information from images is essential for effective person re-identification (Re-Id). In this paper, we propose a compound approach for end-to-end discriminative deep feature learning for person Re-Id based on both body and hand images. We carefully design the Local-Aware Global Attention Network (LAGA-Net), a multi-branch deep network architecture consisting of one branch for spatial attention, one branch for channel attention, one branch for global feature representations and another branch for local feature representations. The attention branches focus on the relevant features of the image while suppressing the irrelevant backgrounds. In order to overcome the weakness of the attention mechanisms, equivariant to pixel shuffling, we integrate relative positional encodings into the spatial attention module to capture the spatial positions of pixels. The global branch intends to preserve the global context or structural information. For the the local branch, which intends to capture the fine-grained information, we perform uniform partitioning to generate stripes on the conv-layer horizontally. We retrieve the parts by conducting a soft partition without explicitly partitioning the images or requiring external cues such as pose estimation. A set of ablation study shows that each component contributes to the increased performance of the LAGA-Net. Extensive evaluations on four popular body-based person Re-Id benchmarks and two publicly available hand datasets demonstrate that our proposed method consistently outperforms existing state-of-the-art methods.
Paper Structure (18 sections, 11 equations, 4 figures, 5 tables)

This paper contains 18 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Attention modules used in our proposed LAGA-Net: (a) Channel Attention Module (CAM), (b) Spatial Attention Module with Relative Positional Encodings (SAM-RPE).
  • Figure 2: Structure of LAGA-Net. Four separate 3D tensors (one for spatial attention branch, one for channel attention branch, one for global branch and the other for local branch) are obtained by passing the input image through the stacked convolutional layers from the backbone network. S3 and S4 are the SAM-RPE after layer 3 and layer 4 (L4) of the ResNet50, respectively. Similarly, C3 and C4 are the CAM after layer 3 and layer 4 of the ResNet50, respectively. Three horizontal partitions (stripes) are also performed on L4 to produce the local branch. Given an input image, six separate 2048-D column feature vectors are obtained by passing it through the backbone network with the 4 branches (the local branch has 3 horizontal stripes). Each classifier predicts the identity (ID) of the input image during training. In case of hand-based person Re-Id, hand input images are used.
  • Figure 3: Some qualitative results of our method on Market-1501 ZheSheTia15 dataset using query vs ranked results retrieved from gallery. Left: query image, Right: a) top-5 results of the LAGA-Net, b) top-5 results of the global (without attention) component of the LAGA-Net (baseline). The green and red bounding boxes denote the correct and the wrong matches, respectively. Feature embeddings from our proposed method (LAGA-Net) give better retrieval performance.
  • Figure 4: Some qualitative results of our method on 11k Mah19 and HD KumXu16 datasets using query vs ranked results retrieved from gallery. From top to bottom row are right dorsal of 11k, left dorsal of 11k, right palmar of 11k, left palmar of 11k and HD datasets. The green and red bounding boxes denote the correct and the wrong matches, respectively.