Table of Contents
Fetching ...

GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition

Fanxu Min, Shaoxiang Guo, Fan Hao, Junyu Dong

TL;DR

GaitMA tackles gait recognition by fusing silhouette-based appearance with skeleton-based structure through joint/limb heatmaps and a dual-branch CNN. A co-attention alignment module and a mutual learning module enable effective cross-modal interaction, guided by a Wasserstein loss to harmonize feature distributions. The approach achieves state-of-the-art results on multiple datasets (Gait3D, OU-MVLP, CASIA-B) and is shown to be robust against occlusions and background clutter. This multi-modal fusion framework offers a scalable path to more reliable gait-based identification in real-world scenarios.

Abstract

Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Existing appearance-based methods utilize CNN or Transformer to extract spatial and temporal features from silhouettes, while model-based methods employ GCN to focus on the special topological structure of skeleton points. However, the quality of silhouettes is limited by complex occlusions, and skeletons lack dense semantic features of the human body. To tackle these problems, we propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA), which effectively combines two modalities to obtain a more robust and comprehensive gait representation for recognition. First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors. Second, a co-attention alignment module is proposed to align the features by element-wise attention. Finally, we propose a mutual learning module, which achieves feature fusion through cross-attention, Wasserstein loss is further introduced to ensure the effective fusion of two modalities. Extensive experimental results demonstrate the superiority of our model on Gait3D, OU-MVLP, and CASIA-B.

GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition

TL;DR

GaitMA tackles gait recognition by fusing silhouette-based appearance with skeleton-based structure through joint/limb heatmaps and a dual-branch CNN. A co-attention alignment module and a mutual learning module enable effective cross-modal interaction, guided by a Wasserstein loss to harmonize feature distributions. The approach achieves state-of-the-art results on multiple datasets (Gait3D, OU-MVLP, CASIA-B) and is shown to be robust against occlusions and background clutter. This multi-modal fusion framework offers a scalable path to more reliable gait-based identification in real-world scenarios.

Abstract

Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Existing appearance-based methods utilize CNN or Transformer to extract spatial and temporal features from silhouettes, while model-based methods employ GCN to focus on the special topological structure of skeleton points. However, the quality of silhouettes is limited by complex occlusions, and skeletons lack dense semantic features of the human body. To tackle these problems, we propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA), which effectively combines two modalities to obtain a more robust and comprehensive gait representation for recognition. First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors. Second, a co-attention alignment module is proposed to align the features by element-wise attention. Finally, we propose a mutual learning module, which achieves feature fusion through cross-attention, Wasserstein loss is further introduced to ensure the effective fusion of two modalities. Extensive experimental results demonstrate the superiority of our model on Gait3D, OU-MVLP, and CASIA-B.
Paper Structure (13 sections, 9 equations, 2 figures, 4 tables)

This paper contains 13 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: A brief visualization of our motivation. Skeleton can effectively complement missing gait features in silhouette across various challenging scenarios.
  • Figure 2: An overview of the proposed framework GaitMA for gait recognition. T&H represents the horizontal mapping and temporal aggregation. Concat and Seq denote the features concatenate and separate, respectively.