Table of Contents
Fetching ...

Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis

Zhao-Yang Wang, Zhimin Shao, Jieneng Chen, Rama Chellappa

TL;DR

Combo-Gait tackles gait recognition under unconstrained conditions by unifying 2D silhouette cues with 3D SMPL geometry in a single transformer-based fusion framework. It advances beyond identity recognition to jointly estimate human attributes (Age, BMI, Gender) through a multitask design that leverages shared representations. Empirical results on the challenging BRIAR datasets show state-of-the-art performance in gait recognition and attribute accuracy, with strong robustness across long distances and varied viewpoints. The work demonstrates the value of multi-modal feature fusion and concurrent attribute learning for real-world, long-range biometric analysis, offering a scalable approach for surveillance and identification tasks.

Abstract

Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50°), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.

Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis

TL;DR

Combo-Gait tackles gait recognition under unconstrained conditions by unifying 2D silhouette cues with 3D SMPL geometry in a single transformer-based fusion framework. It advances beyond identity recognition to jointly estimate human attributes (Age, BMI, Gender) through a multitask design that leverages shared representations. Empirical results on the challenging BRIAR datasets show state-of-the-art performance in gait recognition and attribute accuracy, with strong robustness across long distances and varied viewpoints. The work demonstrates the value of multi-modal feature fusion and concurrent attribute learning for real-world, long-range biometric analysis, offering a scalable approach for surveillance and identification tasks.

Abstract

Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50°), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.

Paper Structure

This paper contains 23 sections, 16 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example of different gait representations with human attributes information. Human attributes such as Age, BMI and Gender influence a subject's walking pattern and shape.
  • Figure 2: The pipeline of the Combo-Gait framework. (1) Video Segmentation and Reconstruction; (2) Multimodal Gait Feature Extraction and Fusion; (3) Gait Feature and Human Attribute Fusion; (4) Gait Recognition and Human Attribute Estimation Execution.
  • Figure 3: Examples of two subjects under various conditions from the BRIAR dataset briar. At longer distances, significant turbulence and noise degrade the image quality. The 3D SMPL parameters corresponding to the two subjects are presented in the second and fourth rows.
  • Figure 4: Visualization of Complementarity between Silhouettes and 3D SMPL parameters