HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition
Lei Wang, Bo Liu, Yinchi Ma, Fangfang Liang, Nawei Guo
TL;DR
HiH tackles unconstrained gait recognition by fusing silhouette and 2D pose through a dual-branch network that leverages a Hierarchical Gait Decomposer to capture gait patterns across depth and width. Pose-guided modules, Deformable Spatial Enhancement and Deformable Temporal Alignment, refine silhouette features in space and time, enabling robust cross-condition recognition. The approach achieves state-of-the-art results on wild datasets (Gait3D, GREW) and competitive performance in controlled settings (OUMVLP, CASIA-B), while maintaining efficiency. This work advances multi-modal gait analysis by effectively aligning modalities and exploiting hierarchical motion structure for generalization in real-world scenarios.
Abstract
Gait recognition has achieved promising advances in controlled settings, yet it significantly struggles in unconstrained environments due to challenges such as view changes, occlusions, and varying walking speeds. Additionally, efforts to fuse multiple modalities often face limited improvements because of cross-modality incompatibility, particularly in outdoor scenarios. To address these issues, we present a multi-modal Hierarchy in Hierarchy network (HiH) that integrates silhouette and pose sequences for robust gait recognition. HiH features a main branch that utilizes Hierarchical Gait Decomposer (HGD) modules for depth-wise and intra-module hierarchical examination of general gait patterns from silhouette data. This approach captures motion hierarchies from overall body dynamics to detailed limb movements, facilitating the representation of gait attributes across multiple spatial resolutions. Complementing this, an auxiliary branch, based on 2D joint sequences, enriches the spatial and temporal aspects of gait analysis. It employs a Deformable Spatial Enhancement (DSE) module for pose-guided spatial attention and a Deformable Temporal Alignment (DTA) module for aligning motion dynamics through learned temporal offsets. Extensive evaluations across diverse indoor and outdoor datasets demonstrate HiH's state-of-the-art performance, affirming a well-balanced trade-off between accuracy and efficiency.
