Attention-based Shape and Gait Representations Learning for Video-based Cloth-Changing Person Re-Identification

Vuong D. Nguyen; Samiha Mirza; Pranav Mantini; Shishir K. Shah

Attention-based Shape and Gait Representations Learning for Video-based Cloth-Changing Person Re-Identification

Vuong D. Nguyen, Samiha Mirza, Pranav Mantini, Shishir K. Shah

TL;DR

This work tackles video-based cloth-changing person re-identification by learning clothing-invariant cues from 3D pose. It introduces ASGL, combining a shape-learning GAT and a gait-learning ST-GAT to extract robust body geometry and motion features, which are fused with appearance through an Adaptive Fusion Module. Extensive experiments on VCCR and CCVID show that ASGL significantly outperforms state-of-the-art methods, especially under clothing variations, with notable gains in rank-1 accuracy and mAP. The results demonstrate the value of integrating geometry- and motion-based representations with appearance for practical, long-term Re-ID in real-world surveillance scenarios.

Abstract

Current state-of-the-art Video-based Person Re-Identification (Re-ID) primarily relies on appearance features extracted by deep learning models. These methods are not applicable for long-term analysis in real-world scenarios where persons have changed clothes, making appearance information unreliable. In this work, we deal with the practical problem of Video-based Cloth-Changing Person Re-ID (VCCRe-ID) by proposing "Attention-based Shape and Gait Representations Learning" (ASGL) for VCCRe-ID. Our ASGL framework improves Re-ID performance under clothing variations by learning clothing-invariant gait cues using a Spatial-Temporal Graph Attention Network (ST-GAT). Given the 3D-skeleton-based spatial-temporal graph, our proposed ST-GAT comprises multi-head attention modules, which are able to enhance the robustness of gait embeddings under viewpoint changes and occlusions. The ST-GAT amplifies the important motion ranges and reduces the influence of noisy poses. Then, the multi-head learning module effectively reserves beneficial local temporal dynamics of movement. We also boost discriminative power of person representations by learning body shape cues using a GAT. Experiments on two large-scale VCCRe-ID datasets demonstrate that our proposed framework outperforms state-of-the-art methods by 12.2% in rank-1 accuracy and 7.0% in mAP.

Attention-based Shape and Gait Representations Learning for Video-based Cloth-Changing Person Re-Identification

TL;DR

Abstract

Paper Structure (27 sections, 7 equations, 5 figures, 5 tables)

This paper contains 27 sections, 7 equations, 5 figures, 5 tables.

INTRODUCTION
RELATED WORKS
Person Re-ID
Image-based CCRe-ID
Video-based CCRe-ID
PROPOSED FRAMEWORK
Overview
Attention-based Shape and Gait Learning branch
Pose Estimator and Refinement Network
Shape Representation Learning
Gait Representation Learning
Adaptive Fusion Module
EXPERIMENTAL SETUP
Datasets and Evaluation Protocols
Evaluation protocols:
...and 12 more sections

Figures (5)

Figure 1: Overview of the proposed ASGL framework. Given a video sequence, for the ASGL branch, 3D pose sequence is first estimated and then refined. A GAT in shape learning sub-branch extracts frame-wise shape features, which are then aggregated for the video-wise shape embedding by a temporal average pooling layer (blue flow). Meanwhile, a spatial-temporal graph is constructed from the refined pose sequence, which is then processed by a ST-GAT to obtain gait embedding (red flow). Appearance, shape and gait are finally fused by the Adaptive Fusion module for the final person representation.
Figure 2: Illustration of 3D pose estimation. Pose is first estimated using an off-the-shelf pose estimator, then normalized to an unified view.
Figure 3: Architecture of the proposed Spatial-Temporal Graph Attention Network for encoding skeleton-based gait.
Figure 4: Architecture of the Adaptive Fusion Module.
Figure 5: Samples from VCCR (top) and CCVID (bottom). For VCCR, we randomly collect $3$ tracklets from the same identity under different clothing. For CCVID, we randomly choose $3$ identities with $2$ tracklets each under different clothing. VCCR poses realistic challenges for Re-ID like entire clothing changes, viewpoint variations, and occlusions, while CCVID contains only frontal images, clearly visible faces, no occlusion and slight clothing change with identities carrying bags.

Attention-based Shape and Gait Representations Learning for Video-based Cloth-Changing Person Re-Identification

TL;DR

Abstract

Attention-based Shape and Gait Representations Learning for Video-based Cloth-Changing Person Re-Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (5)