Table of Contents
Fetching ...

HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network

Hu Cui, Tessai Hayama

TL;DR

A novel Hyper-GCN and Shuffle Mamba (HGMamba) block is proposed, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba, which achieves strong global feature modeling while excelling at local structure modeling.

Abstract

3D human pose lifting is a promising research area that leverages estimated and ground-truth 2D human pose data for training. While existing approaches primarily aim to enhance the performance of estimated 2D poses, they often struggle when applied to ground-truth 2D pose data. We observe that achieving accurate 3D pose reconstruction from ground-truth 2D poses requires precise modeling of local pose structures, alongside the ability to extract robust global spatio-temporal features. To address these challenges, we propose a novel Hyper-GCN and Shuffle Mamba (HGMamba) block, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba. The Hyper-GCN stream models the human body structure as hypergraphs with varying levels of granularity to effectively capture local joint dependencies. Meanwhile, the Shuffle Mamba stream leverages a state space model to perform spatio-temporal scanning across all joints, enabling the establishment of global dependencies. By adaptively fusing these two representations, HGMamba achieves strong global feature modeling while excelling at local structure modeling. We stack multiple HGMamba blocks to create three variants of our model, allowing users to select the most suitable configuration based on the desired speed-accuracy trade-off. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate the effectiveness of our approach. HGMamba-B achieves state-of-the-art results, with P1 errors of 38.65 mm and 14.33 mm on the respective datasets. Code and models are available: https://github.com/HuCui2022/HGMamba

HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network

TL;DR

A novel Hyper-GCN and Shuffle Mamba (HGMamba) block is proposed, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba, which achieves strong global feature modeling while excelling at local structure modeling.

Abstract

3D human pose lifting is a promising research area that leverages estimated and ground-truth 2D human pose data for training. While existing approaches primarily aim to enhance the performance of estimated 2D poses, they often struggle when applied to ground-truth 2D pose data. We observe that achieving accurate 3D pose reconstruction from ground-truth 2D poses requires precise modeling of local pose structures, alongside the ability to extract robust global spatio-temporal features. To address these challenges, we propose a novel Hyper-GCN and Shuffle Mamba (HGMamba) block, which processes input data through two parallel streams: Hyper-GCN and Shuffle-Mamba. The Hyper-GCN stream models the human body structure as hypergraphs with varying levels of granularity to effectively capture local joint dependencies. Meanwhile, the Shuffle Mamba stream leverages a state space model to perform spatio-temporal scanning across all joints, enabling the establishment of global dependencies. By adaptively fusing these two representations, HGMamba achieves strong global feature modeling while excelling at local structure modeling. We stack multiple HGMamba blocks to create three variants of our model, allowing users to select the most suitable configuration based on the desired speed-accuracy trade-off. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate the effectiveness of our approach. HGMamba-B achieves state-of-the-art results, with P1 errors of 38.65 mm and 14.33 mm on the respective datasets. Code and models are available: https://github.com/HuCui2022/HGMamba

Paper Structure

This paper contains 13 sections, 21 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of HGMamba. It is composed of $N$ dual-stream HGM blocks. The Hyper-GCN stream captures local human body part semantics, while the state-space module (Mamba-block) effectively models robust global information.
  • Figure 2: Visualization of 3D pose results from noised 2D Pose.
  • Figure 3: Visualization 3D results for GT-2D and estimated-2D. * denotes result for GT-2D pose. $\dagger$ denotes result for estimated-2D pose by Stacked Hourglass.