Combo-Gait: Unified Transformer Framework for Multi-Modal Gait Recognition and Attribute Analysis
Zhao-Yang Wang, Zhimin Shao, Jieneng Chen, Rama Chellappa
TL;DR
Combo-Gait tackles gait recognition under unconstrained conditions by unifying 2D silhouette cues with 3D SMPL geometry in a single transformer-based fusion framework. It advances beyond identity recognition to jointly estimate human attributes (Age, BMI, Gender) through a multitask design that leverages shared representations. Empirical results on the challenging BRIAR datasets show state-of-the-art performance in gait recognition and attribute accuracy, with strong robustness across long distances and varied viewpoints. The work demonstrates the value of multi-modal feature fusion and concurrent attribute learning for real-world, long-range biometric analysis, offering a scalable approach for surveillance and identification tasks.
Abstract
Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50°), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.
