Table of Contents
Fetching ...

Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes

Zhao-Yang Wang, Jieneng Chen, Jiang Liu, Yuxiang Guo, Rama Chellappa

TL;DR

Mesh-Gait addresses the fragility of 2D silhouette-based gait recognition under occlusion and viewpoint changes by introducing an end-to-end framework that reconstructs intermediate 3D heatmaps from 2D silhouettes. The model combines a 2D silhouette branch with a 3D heatmap branch, where the heatmaps are progressively refined through supervision on reconstructed joints, markers, and meshes, and then fused with silhouette features for recognition. The approach achieves state-of-the-art results on Gait3D and OUMVLP-Mesh, demonstrates strong robustness across backbones and viewing conditions, and offers real-time potential by removing mesh reconstruction during inference. This work advances practical gait recognition by integrating 3D structural information efficiently via heatmaps, enabling more reliable identity verification in unconstrained environments.

Abstract

Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.

Mesh-Gait: A Unified Framework for Gait Recognition Through Multi-Modal Representation Learning from 2D Silhouettes

TL;DR

Mesh-Gait addresses the fragility of 2D silhouette-based gait recognition under occlusion and viewpoint changes by introducing an end-to-end framework that reconstructs intermediate 3D heatmaps from 2D silhouettes. The model combines a 2D silhouette branch with a 3D heatmap branch, where the heatmaps are progressively refined through supervision on reconstructed joints, markers, and meshes, and then fused with silhouette features for recognition. The approach achieves state-of-the-art results on Gait3D and OUMVLP-Mesh, demonstrates strong robustness across backbones and viewing conditions, and offers real-time potential by removing mesh reconstruction during inference. This work advances practical gait recognition by integrating 3D structural information efficiently via heatmaps, enabling more reliable identity verification in unconstrained environments.

Abstract

Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.

Paper Structure

This paper contains 22 sections, 18 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example comparison highlighting the differences between the Mesh-Gait framework and traditional multi-modal methods in terms of training, inference, and efficiency. Mesh-Gait is trained from scratch using supervised learning but only requires mask segmentation and mesh reconstruction during training. At inference time, only mask segmentation is required, eliminating the need for mesh reconstruction. Mesh-Gait is more efficient, as it has a lower computational cost since reconstruction is not required during inference.
  • Figure 2: The main architecture of Mesh-Gait consists of two parallel branches that process silhouette sequences extracted from RGB videos using an image segmentation model. The 2D feature branch employs a convolutional backbone to extract gait features from 2D silhouettes. In parallel, the 3D feature branch reconstructs 3D heatmaps as an intermediate representation from the silhouette sequences using a 3D estimator, which is trained from scratch. To progressively refine the 3D heatmaps during training, they are used for reconstructing 3D joints, virtual markers, and meshes in a supervised manner. In addition, the reconstructed 3D heatmaps are also used for 3D feature extraction. Features from both branches are then fused and mapped for gait recognition. The model is trained in a supervised manner using a combination of triplet loss, cross-entropy loss, L1 loss, and L2 loss.
  • Figure 3: Visualization of reconstructed 3D representations during testing. Each row represents the reconstruction results on a different subject. The predicted virtual markers and ground truth virtual markers are shown on the surface of the ground truth meshes. The predicted meshes reconstructed from predicted virtual markers are shown in the fourth column. The ground truth of Meshes is shown in the fifth column.
  • Figure : 3D Representation Reconstruction from Silhouettes