Table of Contents
Fetching ...

Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network

Xinyi Zhang, Qiqi Bao, Qinpeng Cui, Wenming Yang, Qingmin Liao

TL;DR

This work proposes a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic), which introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs.

Abstract

Current state-of-the-art (SOTA) methods in 3D Human Pose Estimation (HPE) are primarily based on Transformers. However, existing Transformer-based 3D HPE backbones often encounter a trade-off between accuracy and computational efficiency. To resolve the above dilemma, in this work, we leverage recent advances in state space models and utilize Mamba for high-quality and efficient long-range modeling. Nonetheless, Mamba still faces challenges in precisely exploiting local dependencies between joints. To address these issues, we propose a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic). This architecture introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs. By adaptively fusing representations from Mamba and GCN, Pose Magic demonstrates superior capability in learning the underlying 3D structure. To meet the requirements of real-time inference, we also provide a fully causal version. Extensive experiments show that Pose Magic achieves new SOTA results ($\downarrow 0.9 mm$) while saving $74.1\%$ FLOPs. In addition, Pose Magic exhibits optimal motion consistency and the ability to generalize to unseen sequence lengths.

Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network

TL;DR

This work proposes a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic), which introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs.

Abstract

Current state-of-the-art (SOTA) methods in 3D Human Pose Estimation (HPE) are primarily based on Transformers. However, existing Transformer-based 3D HPE backbones often encounter a trade-off between accuracy and computational efficiency. To resolve the above dilemma, in this work, we leverage recent advances in state space models and utilize Mamba for high-quality and efficient long-range modeling. Nonetheless, Mamba still faces challenges in precisely exploiting local dependencies between joints. To address these issues, we propose a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic). This architecture introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs. By adaptively fusing representations from Mamba and GCN, Pose Magic demonstrates superior capability in learning the underlying 3D structure. To meet the requirements of real-time inference, we also provide a fully causal version. Extensive experiments show that Pose Magic achieves new SOTA results () while saving FLOPs. In addition, Pose Magic exhibits optimal motion consistency and the ability to generalize to unseen sequence lengths.
Paper Structure (19 sections, 15 equations, 6 figures, 5 tables)

This paper contains 19 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparisons of transformer-based methods on Human3.6M ($\downarrow$). FLOPs/frame denotes floating point operations per output frame. The proposed Pose Magic attains superior results, while maintaining computational efficiency.
  • Figure 2: Overview of Pose Magic. It consists of $N$ dual-stream Magic Blocks, with GCN capturing local information and Mamba capturing global information. Spatial GCN/Mamba models connections among joints within a frame, while the Temporal one tracks each joint's motion over time.
  • Figure 3: Different Mamba structures. (a) Bidirectional: process information forward, backward and independently. (b) Unidirectional: process information forward and independently. Here, current information only relates to present and past data, making it suitable for real-time applications.
  • Figure 4: (a) GCN structure. (b) Spatial GCN uses the Human3.6M skeleton as the adjacency matrix. (c) Temporal GCN uses K-NN for connection edges based on joint similarity across frames. After K-NN, each row connects to $K$ columns. Top: bidirectional adjacency matrix. Bottom: unidirectional adjacency matrix: K-NN after a causal mask.
  • Figure 5: Comparison results of ACC-ERR on Human3.6M.
  • ...and 1 more figures