Table of Contents
Fetching ...

CLASH: Complementary Learning with Neural Architecture Search for Gait Recognition

Huanzhang Dou, Pengyi Zhang, Yuhan Zhao, Lu Jin, Xi Li

TL;DR

This work tackles gait recognition by addressing the limitations of sparse silhouette representations. It introduces DSTF, a dense spatial-temporal texture derived from a bidirectional distance transform, and combines it with silhouette features through NAS-driven complementary learning (NCL) via a multi-descriptor cell. The framework achieves state-of-the-art performance across in-the-lab and in-the-wild datasets, notably improving robustness and accuracy in challenging cross-view and unconstrained conditions. The approach demonstrates the effectiveness of dense texture representations and automated fusion architecture for gait analysis, with practical implications for surveillance and biometric systems.

Abstract

Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitable to be represented with dense texture. To enhance the sensitivity to the walking pattern while maintaining the robustness of recognition, we present a Complementary Learning with neural Architecture Search (CLASH) framework, consisting of walking pattern sensitive gait descriptor named dense spatial-temporal field (DSTF) and neural architecture search based complementary learning (NCL). Specifically, DSTF transforms the representation from the sparse binary boundary into the dense distance-based texture, which is sensitive to the walking pattern at the pixel level. Further, NCL presents a task-specific search space for complementary learning, which mutually complements the sensitivity of DSTF and the robustness of the silhouette to represent the walking pattern effectively. Extensive experiments demonstrate the effectiveness of the proposed methods under both in-the-lab and in-the-wild scenarios. On CASIA-B, we achieve rank-1 accuracy of 98.8%, 96.5%, and 89.3% under three conditions. On OU-MVLP, we achieve rank-1 accuracy of 91.9%. Under the latest in-the-wild datasets, we outperform the latest silhouette-based methods by 16.3% and 19.7% on Gait3D and GREW, respectively.

CLASH: Complementary Learning with Neural Architecture Search for Gait Recognition

TL;DR

This work tackles gait recognition by addressing the limitations of sparse silhouette representations. It introduces DSTF, a dense spatial-temporal texture derived from a bidirectional distance transform, and combines it with silhouette features through NAS-driven complementary learning (NCL) via a multi-descriptor cell. The framework achieves state-of-the-art performance across in-the-lab and in-the-wild datasets, notably improving robustness and accuracy in challenging cross-view and unconstrained conditions. The approach demonstrates the effectiveness of dense texture representations and automated fusion architecture for gait analysis, with practical implications for surveillance and biometric systems.

Abstract

Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitable to be represented with dense texture. To enhance the sensitivity to the walking pattern while maintaining the robustness of recognition, we present a Complementary Learning with neural Architecture Search (CLASH) framework, consisting of walking pattern sensitive gait descriptor named dense spatial-temporal field (DSTF) and neural architecture search based complementary learning (NCL). Specifically, DSTF transforms the representation from the sparse binary boundary into the dense distance-based texture, which is sensitive to the walking pattern at the pixel level. Further, NCL presents a task-specific search space for complementary learning, which mutually complements the sensitivity of DSTF and the robustness of the silhouette to represent the walking pattern effectively. Extensive experiments demonstrate the effectiveness of the proposed methods under both in-the-lab and in-the-wild scenarios. On CASIA-B, we achieve rank-1 accuracy of 98.8%, 96.5%, and 89.3% under three conditions. On OU-MVLP, we achieve rank-1 accuracy of 91.9%. Under the latest in-the-wild datasets, we outperform the latest silhouette-based methods by 16.3% and 19.7% on Gait3D and GREW, respectively.
Paper Structure (16 sections, 9 equations, 10 figures, 10 tables)

This paper contains 16 sections, 9 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Comparison between the silhouette and proposed DSTF. The pixel of interest computed by GEnI 5522296 refers to the set of pixels that contribute most to the walking pattern. The walking patterns of the pixels on the foreground, boundary, and background are reflected in the value changes of the corresponding pixels. The pixel on DSTF with blue denotes the negative pixel value.
  • Figure 2: Overview of CLASH framework. Top: The features of the silhouette and DSTF are extracted by the feature extractor. Then, complementary learning for two heterogeneous descriptors is conducted through neural architecture search, i.e., the multi-descriptor (MD) cell. The final feature is obtained by the temporal aggregation and a linear chao2019gaitset. Bottom: The features of silhouette stream and DSTF stream mutually complement each other, i.e., regularize and densify. Note that the arrow between the DSTF and silhouette refers to the interaction rather than the transformation to each other.
  • Figure 3: Comparison between silhouette, foreground of Bi-DT (Fore-DT), background of Bi-DT (Back-DT), and a single frame of DSTF. The background is marked with blue for visualization of negative pixel values.
  • Figure 4: Motion information, i.e., frame difference comparison. Most pixels on the single frame of DSTF could change over time to informatively represent the temporal information, while most pixels on silhouette cannot change over time.
  • Figure 5: Illustration of the cell topology. A and B represent the features of the two gait descriptors and are assigned to the two input nodes Input X and Input Y, respectively. The two input nodes, together with two intermediate nodes and one output node, form the whole topology of the cell.
  • ...and 5 more figures