Table of Contents
Fetching ...

GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Guozhen Peng, Yunhong Wang, Yuwei Zhao, Shaoxiong Zhang, Annan Li

TL;DR

GLGait tackles gait recognition in unconstrained environments by addressing long-range temporal modeling with a Global-Local Temporal Module (GLTM) that combines Pseudo Global Temporal Self-Attention (PGTA) and temporal convolution, embedded in GL-3D blocks with a 2D vision backbone. It further strengthens learning with Center-Augmented Triplet Loss (CTL), which uses class centers as positives to reduce intra-class variance and increase positive samples. Empirically, GLGait achieves state-of-the-art results on in-the-wild datasets Gait3D and GREW, offering notable gains on long sequences while maintaining memory efficiency relative to full MHSA-based transformers. The approach provides a scalable, practical framework for robust gait recognition in real-world surveillance settings.

Abstract

Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait.

GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

TL;DR

GLGait tackles gait recognition in unconstrained environments by addressing long-range temporal modeling with a Global-Local Temporal Module (GLTM) that combines Pseudo Global Temporal Self-Attention (PGTA) and temporal convolution, embedded in GL-3D blocks with a 2D vision backbone. It further strengthens learning with Center-Augmented Triplet Loss (CTL), which uses class centers as positives to reduce intra-class variance and increase positive samples. Empirically, GLGait achieves state-of-the-art results on in-the-wild datasets Gait3D and GREW, offering notable gains on long sequences while maintaining memory efficiency relative to full MHSA-based transformers. The approach provides a scalable, practical framework for robust gait recognition in real-world surveillance settings.

Abstract

Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, , Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait.
Paper Structure (17 sections, 8 equations, 5 figures, 9 tables)

This paper contains 17 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison of local and global temporal receptive field. (a) Gait cycles are evenly distributed in laboratory scenarios (Pedestrian $\#1$), thus proper-sized local receptive field can capture a complete cycle. While in the wild (Pedestrian $\#2$) the distribution is sparse and random, which implies a larger receptive field is necessary. Corresponding sequences are sampled from CASIA-B yu2006framework and Gait3D zheng2022gait, respectively. (b) Sequence length and TRF statistics in CASIA-B and Gait3D, where TRF is the temporal receptive field.
  • Figure 2: Pipeline of the proposed GLGait. The backbone mainly consists of the vision encoder and GL-3D blocks. Specifically, we use Pseudo Global Temporal Self-Attention (PGTA) to extract global temporal information and a temporal convolution operation to enhance the local temporal information extraction in Global-Local Temporal Module (GLTM). TP denotes the Temporal Max Pooling operation, HP is the Horizontal Pooling operation fu2019horizontalfan2023opengait, FC is the separate fully connected layers chao2019gaitset, and BNN is BNNeck luo2019bag. The final loss function is composed of a center-augmented triplet loss (CTL) and a cross-entropy loss.
  • Figure 3: Pseudo Global Temporal Self-Attention (PGTA) with a temporal convolution operation (T-Conv).
  • Figure 4: Comparison of triplet loss HermansBL17 (a) and proposed center-augmented triplet loss (b). $\textbf{w}_{i}$ is class center in BNNeck luo2019bag, $\textbf{x}_{i}$ is sample feature, dashed lines in the circle are class boundaries, $\Rightarrow$ is the gradient of the feature.
  • Figure 5: Silhouette score in Temporal Max Pooling phase, where the sequence contains 474 silhouettes from Gait3D.