Table of Contents
Fetching ...

GaitASMS: Gait Recognition by Adaptive Structured Spatial Representation and Multi-Scale Temporal Aggregation

Yan Sun, Hu Long, Xueling Feng, Mark Nixon

TL;DR

GaitASMS addresses occlusion and view-variation in gait recognition by integrating an Adaptive Structured Representation Extraction (ASRE) module for edge-aware local features with a Global Feature Extractor, and a Multi-Scale Temporal Aggregation (MSTA) module for long-short-range temporal modeling using dilated 3D convolutions. A novel random-mask augmentation enlarges the occlusion-robustness of the model. Across CASIA-B and OU-MVLP, GaitASMS achieves state-of-the-art or competitive performance, with extensive ablations confirming the effectiveness of ASRE, MSTA, and the random-mask strategy, and demonstrating good transferability to related architectures like GaitGL.

Abstract

Gait recognition is one of the most promising video-based biometric technologies. The edge of silhouettes and motion are the most informative feature and previous studies have explored them separately and achieved notable results. However, due to occlusions and variations in viewing angles, their gait recognition performance is often affected by the predefined spatial segmentation strategy. Moreover, traditional temporal pooling usually neglects distinctive temporal information in gait. To address the aforementioned issues, we propose a novel gait recognition framework, denoted as GaitASMS, which can effectively extract the adaptive structured spatial representations and naturally aggregate the multi-scale temporal information. The Adaptive Structured Representation Extraction Module (ASRE) separates the edge of silhouettes by using the adaptive edge mask and maximizes the representation in semantic latent space. Moreover, the Multi-Scale Temporal Aggregation Module (MSTA) achieves effective modeling of long-short-range temporal information by temporally aggregated structure. Furthermore, we propose a new data augmentation, denoted random mask, to enrich the sample space of long-term occlusion and enhance the generalization of the model. Extensive experiments conducted on two datasets demonstrate the competitive advantage of proposed method, especially in complex scenes, i.e. BG and CL. On the CASIA-B dataset, GaitASMS achieves the average accuracy of 93.5\% and outperforms the baseline on rank-1 accuracies by 3.4\% and 6.3\%, respectively, in BG and CL. The ablation experiments demonstrate the effectiveness of ASRE and MSTA. The source code is available at https://github.com/YanSungithub/GaitASMS.

GaitASMS: Gait Recognition by Adaptive Structured Spatial Representation and Multi-Scale Temporal Aggregation

TL;DR

GaitASMS addresses occlusion and view-variation in gait recognition by integrating an Adaptive Structured Representation Extraction (ASRE) module for edge-aware local features with a Global Feature Extractor, and a Multi-Scale Temporal Aggregation (MSTA) module for long-short-range temporal modeling using dilated 3D convolutions. A novel random-mask augmentation enlarges the occlusion-robustness of the model. Across CASIA-B and OU-MVLP, GaitASMS achieves state-of-the-art or competitive performance, with extensive ablations confirming the effectiveness of ASRE, MSTA, and the random-mask strategy, and demonstrating good transferability to related architectures like GaitGL.

Abstract

Gait recognition is one of the most promising video-based biometric technologies. The edge of silhouettes and motion are the most informative feature and previous studies have explored them separately and achieved notable results. However, due to occlusions and variations in viewing angles, their gait recognition performance is often affected by the predefined spatial segmentation strategy. Moreover, traditional temporal pooling usually neglects distinctive temporal information in gait. To address the aforementioned issues, we propose a novel gait recognition framework, denoted as GaitASMS, which can effectively extract the adaptive structured spatial representations and naturally aggregate the multi-scale temporal information. The Adaptive Structured Representation Extraction Module (ASRE) separates the edge of silhouettes by using the adaptive edge mask and maximizes the representation in semantic latent space. Moreover, the Multi-Scale Temporal Aggregation Module (MSTA) achieves effective modeling of long-short-range temporal information by temporally aggregated structure. Furthermore, we propose a new data augmentation, denoted random mask, to enrich the sample space of long-term occlusion and enhance the generalization of the model. Extensive experiments conducted on two datasets demonstrate the competitive advantage of proposed method, especially in complex scenes, i.e. BG and CL. On the CASIA-B dataset, GaitASMS achieves the average accuracy of 93.5\% and outperforms the baseline on rank-1 accuracies by 3.4\% and 6.3\%, respectively, in BG and CL. The ablation experiments demonstrate the effectiveness of ASRE and MSTA. The source code is available at https://github.com/YanSungithub/GaitASMS.
Paper Structure (14 sections, 16 equations, 5 figures, 8 tables)

This paper contains 14 sections, 16 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The overview of our GaitASMS. "ASRE" represents the Adaptive Structured Representation Extraction Module. "Addition" means the element-wise addition of local features and global features. "Cat" indicates combining local features and global features in the H dimension. "MSTA" represents the Multi-Scale Temporal Feature Aggregation Module, which is composed of the dilated convolution residual blocks. "HPP" means horizontal pyramid pooling ref27.
  • Figure 2: Overview of the ASRE. LEM is the Local Feature Extractor Based on Edge Mask. GFE is the Global Feature Extractor.
  • Figure 3: Operation of the Edge Mask.
  • Figure 4: Overview of the DCB. It is the Dilated Convolution Block, which is composed of dilated 3D convolution layers, Relu, and BatchNorm.
  • Figure 5: Visualization of the heatmaps for different layers in GaitASMS on CASIA-B. (a) Top: the sequence of the silhouettes; (b) Middle: the heatmaps of the ASRE-1; (c) Below: the heatmaps of the MSTA. The red boxes represent the silhouettes and heatmaps with self-occlusion. The blue boxes represent frames and heatmaps with missing hand contours.