Table of Contents
Fetching ...

GaitSTR: Gait Recognition with Sequential Two-stream Refinement

Wanrong Zheng, Haidong Zhu, Zhaoheng Zheng, Ram Nevatia

TL;DR

Problem: robust gait recognition from walking sequences is challenged by clothing- and object-induced appearance variance and framewise skeleton jitters. Approach: GaitSTR fuses silhouettes with skeletons, refines joints and bones via a skeleton correction network, and uses cross-modal adapters to enable sequential, two-stream refinement guided by silhouette temporal cues. Contributions: joint+bone skeleton representation, internal skeleton self-correction, silhouette-guided cross-modal correction, and end-to-end training with triplet and classification losses; evaluated on CASIA-B, OU-MVLP, Gait3D, GREW, achieving state-of-the-art results without extra annotations. Significance: improved robustness to occlusions and appearance changes, enabling more reliable gait-based identification in real-world, long-distance scenarios.

Abstract

Gait recognition aims to identify a person based on their walking sequences, serving as a useful biometric modality as it can be observed from long distances without requiring cooperation from the subject. In representing a person's walking sequence, silhouettes and skeletons are the two primary modalities used. Silhouette sequences lack detailed part information when overlapping occurs between different body segments and are affected by carried objects and clothing. Skeletons, comprising joints and bones connecting the joints, provide more accurate part information for different segments; however, they are sensitive to occlusions and low-quality images, causing inconsistencies in frame-wise results within a sequence. In this paper, we explore the use of a two-stream representation of skeletons for gait recognition, alongside silhouettes. By fusing the combined data of silhouettes and skeletons, we refine the two-stream skeletons, joints, and bones through self-correction in graph convolution, along with cross-modal correction with temporal consistency from silhouettes. We demonstrate that with refined skeletons, the performance of the gait recognition model can achieve further improvement on public gait recognition datasets compared with state-of-the-art methods without extra annotations.

GaitSTR: Gait Recognition with Sequential Two-stream Refinement

TL;DR

Problem: robust gait recognition from walking sequences is challenged by clothing- and object-induced appearance variance and framewise skeleton jitters. Approach: GaitSTR fuses silhouettes with skeletons, refines joints and bones via a skeleton correction network, and uses cross-modal adapters to enable sequential, two-stream refinement guided by silhouette temporal cues. Contributions: joint+bone skeleton representation, internal skeleton self-correction, silhouette-guided cross-modal correction, and end-to-end training with triplet and classification losses; evaluated on CASIA-B, OU-MVLP, Gait3D, GREW, achieving state-of-the-art results without extra annotations. Significance: improved robustness to occlusions and appearance changes, enabling more reliable gait-based identification in real-world, long-distance scenarios.

Abstract

Gait recognition aims to identify a person based on their walking sequences, serving as a useful biometric modality as it can be observed from long distances without requiring cooperation from the subject. In representing a person's walking sequence, silhouettes and skeletons are the two primary modalities used. Silhouette sequences lack detailed part information when overlapping occurs between different body segments and are affected by carried objects and clothing. Skeletons, comprising joints and bones connecting the joints, provide more accurate part information for different segments; however, they are sensitive to occlusions and low-quality images, causing inconsistencies in frame-wise results within a sequence. In this paper, we explore the use of a two-stream representation of skeletons for gait recognition, alongside silhouettes. By fusing the combined data of silhouettes and skeletons, we refine the two-stream skeletons, joints, and bones through self-correction in graph convolution, along with cross-modal correction with temporal consistency from silhouettes. We demonstrate that with refined skeletons, the performance of the gait recognition model can achieve further improvement on public gait recognition datasets compared with state-of-the-art methods without extra annotations.
Paper Structure (10 sections, 4 equations, 4 figures, 9 tables)

This paper contains 10 sections, 4 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Visualization of the (a) silhouette and (b) skeleton sequence used for gait recognition. Silhouettes show different contours with different clothes and carried-on objects, while the skeletons suffer from jittery detection results in the video.
  • Figure 2: Our proposed architecture for GaitSTR. Trapezoids consists of trainable modules, and modules of the same color and fill-in patterns in the same model share the weights. Dashed lines represent the operation of feature copying. $S$, $J$, and $B$ are the input silhouettes, joints, and bones, respectively. $F_S$ represents silhouette features, while $F_J$ and $F_{B}$ represent joint and bone features for skeleton representations.
  • Figure 3: Architecture of the skeleton correction network. $F_J$ and $F_B$ represent the joint and bone frame-wise features encoded from $J$ (joints) and $B$ (bones), respectively. The symbol 'C' denotes concatenation, and the plus sign denotes addition. 'Corr' refers to the skeleton correction network, while 'CMA' stands for the layer-level cross-modal adapter, and $k$ denotes the number of layers over which cross-modal skeleton correction operations are repeated between bones and joints.
  • Figure 4: Visualization of successful and failed refined skeletons with GaitSTR. For each example, from left to right, we have original skeletons, refined skeletons and its neighbor frames.