HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

Lei Wang; Bo Liu; Yinchi Ma; Fangfang Liang; Nawei Guo

HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

Lei Wang, Bo Liu, Yinchi Ma, Fangfang Liang, Nawei Guo

TL;DR

HiH tackles unconstrained gait recognition by fusing silhouette and 2D pose through a dual-branch network that leverages a Hierarchical Gait Decomposer to capture gait patterns across depth and width. Pose-guided modules, Deformable Spatial Enhancement and Deformable Temporal Alignment, refine silhouette features in space and time, enabling robust cross-condition recognition. The approach achieves state-of-the-art results on wild datasets (Gait3D, GREW) and competitive performance in controlled settings (OUMVLP, CASIA-B), while maintaining efficiency. This work advances multi-modal gait analysis by effectively aligning modalities and exploiting hierarchical motion structure for generalization in real-world scenarios.

Abstract

Gait recognition has achieved promising advances in controlled settings, yet it significantly struggles in unconstrained environments due to challenges such as view changes, occlusions, and varying walking speeds. Additionally, efforts to fuse multiple modalities often face limited improvements because of cross-modality incompatibility, particularly in outdoor scenarios. To address these issues, we present a multi-modal Hierarchy in Hierarchy network (HiH) that integrates silhouette and pose sequences for robust gait recognition. HiH features a main branch that utilizes Hierarchical Gait Decomposer (HGD) modules for depth-wise and intra-module hierarchical examination of general gait patterns from silhouette data. This approach captures motion hierarchies from overall body dynamics to detailed limb movements, facilitating the representation of gait attributes across multiple spatial resolutions. Complementing this, an auxiliary branch, based on 2D joint sequences, enriches the spatial and temporal aspects of gait analysis. It employs a Deformable Spatial Enhancement (DSE) module for pose-guided spatial attention and a Deformable Temporal Alignment (DTA) module for aligning motion dynamics through learned temporal offsets. Extensive evaluations across diverse indoor and outdoor datasets demonstrate HiH's state-of-the-art performance, affirming a well-balanced trade-off between accuracy and efficiency.

HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 7 figures, 6 tables)

This paper contains 18 sections, 9 equations, 7 figures, 6 tables.

Introduction
Related Work
Single-modal Gait Recognition
Multi-modal Gait Recognition
Method
Framework Overview
Hierarchical Gait Decomposer (HGD)
Spatially Enhanced HGD (SE-HGD)
Temporally Enhanced HGD (TE-HGD)
Loss Function
Experiments
Datasets
Implementation Details
Comparison with State-of-the-Art Methods
Ablation Study
...and 3 more sections

Figures (7)

Figure 1: Motivation of the proposed HiH approach. Left: Performance degradation from controlled to uncontrolled scenarios. Right: Overview of HiH’s multi-modal fusion of silhouette and 2D keypoints sequences through pose-guided spatio-temporal processing.
Figure 2: Overview of the HiH Framework. HiH takes silhouette sequence $X_{\text{sil}}$ and pose sequence $X_{\text{pose}}$ as inputs. The main branch uses multiple Hierarchical Gait Decomposers (HGDs) to extract general gait motion patterns in both depth and width. The auxiliary branch enhances HGDs through pose-guided Deformable Spatial Enhancement (DSE) and Deformable Temporal Alignment (DTA) modules, where DTA also performs temporal downsampling with stride $t$. Integrated outputs from both branches undergo Temporal Pooling (TP) and Horizontal Pooling (HP), and are then transformed into gait embeddings through fully-connected layers.
Figure 3: The architecture of the Hierarchical Gait Decomposer (HGD), where the number in parentheses denotes the amount of horizontal splits.
Figure 4: The detailed structure of the deformable spatial enhancement module (DSE).
Figure 5: The detailed structure of the Deformable Temporal Alignment module (DTA).
...and 2 more figures

HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

TL;DR

Abstract

HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)