Table of Contents
Fetching ...

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

Yunlong Huang, Junshuo Liu, Ke Xian, Robert Caiming Qiu

TL;DR

PoseMamba advances monocular 3D human pose estimation by fully embracing state-space models to achieve linear-time spatio-temporal modeling, addressing the quadratic bottleneck of transformer attention. It introduces a bidirectional global-local spatio-temporal Mamba block and a reordering strategy that enhances local limb modeling while preserving global skeleton context, enabling efficient learning of spatial-temporal correlations. Evaluations on Human3.6M and MPI-INF-3DHP show state-of-the-art accuracy with significantly fewer parameters and MACs, demonstrating both effectiveness and efficiency. The results highlight the viability of SSM-based architectures for 3D HPE and point to PoseMamba as a promising, lightweight backbone for future 3D vision systems.

Abstract

Transformers have significantly advanced the field of 3D human pose estimation (HPE). However, existing transformer-based methods primarily use self-attention mechanisms for spatio-temporal modeling, leading to a quadratic complexity, unidirectional modeling of spatio-temporal relationships, and insufficient learning of spatial-temporal correlations. Recently, the Mamba architecture, utilizing the state space model (SSM), has exhibited superior long-range modeling capabilities in a variety of vision tasks with linear complexity. In this paper, we propose PoseMamba, a novel purely SSM-based approach with linear complexity for 3D human pose estimation in monocular video. Specifically, we propose a bidirectional global-local spatio-temporal SSM block that comprehensively models human joint relations within individual frames as well as temporal correlations across frames. Within this bidirectional global-local spatio-temporal SSM block, we introduce a reordering strategy to enhance the local modeling capability of the SSM. This strategy provides a more logical geometric scanning order and integrates it with the global SSM, resulting in a combined global-local spatial scan. We have quantitatively and qualitatively evaluated our approach using two benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments demonstrate that PoseMamba achieves state-of-the-art performance on both datasets while maintaining a smaller model size and reducing computational costs. The code and models will be released.

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

TL;DR

PoseMamba advances monocular 3D human pose estimation by fully embracing state-space models to achieve linear-time spatio-temporal modeling, addressing the quadratic bottleneck of transformer attention. It introduces a bidirectional global-local spatio-temporal Mamba block and a reordering strategy that enhances local limb modeling while preserving global skeleton context, enabling efficient learning of spatial-temporal correlations. Evaluations on Human3.6M and MPI-INF-3DHP show state-of-the-art accuracy with significantly fewer parameters and MACs, demonstrating both effectiveness and efficiency. The results highlight the viability of SSM-based architectures for 3D HPE and point to PoseMamba as a promising, lightweight backbone for future 3D vision systems.

Abstract

Transformers have significantly advanced the field of 3D human pose estimation (HPE). However, existing transformer-based methods primarily use self-attention mechanisms for spatio-temporal modeling, leading to a quadratic complexity, unidirectional modeling of spatio-temporal relationships, and insufficient learning of spatial-temporal correlations. Recently, the Mamba architecture, utilizing the state space model (SSM), has exhibited superior long-range modeling capabilities in a variety of vision tasks with linear complexity. In this paper, we propose PoseMamba, a novel purely SSM-based approach with linear complexity for 3D human pose estimation in monocular video. Specifically, we propose a bidirectional global-local spatio-temporal SSM block that comprehensively models human joint relations within individual frames as well as temporal correlations across frames. Within this bidirectional global-local spatio-temporal SSM block, we introduce a reordering strategy to enhance the local modeling capability of the SSM. This strategy provides a more logical geometric scanning order and integrates it with the global SSM, resulting in a combined global-local spatial scan. We have quantitatively and qualitatively evaluated our approach using two benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments demonstrate that PoseMamba achieves state-of-the-art performance on both datasets while maintaining a smaller model size and reducing computational costs. The code and models will be released.
Paper Structure (28 sections, 12 equations, 7 figures, 6 tables)

This paper contains 28 sections, 12 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparisons of recent 3D human pose estimation techniques on Human3.6M ionescu2013human3 (lower is better). MACs/frame represents multiply-accumulate operations for each output frame. Our PoseMamba method presents various versions and achieves superior results, while maintaining computational efficiency.
  • Figure 2: The pipeline of our PoseMamba. We start by using fully connected layer to project the input keypoint sequence, and then embed position and temporal embedding matrix into sequence. After that, we feed the sequence into the Mamba blocks.
  • Figure 3: Illustration of various spatio-temporal modeling mechanisms. (a) Self-attention vaswani2017attentionViT. (b) Bidirectional spatio-temporal scan liu2024vmamba. (c) Our proposed bidirectional global-local spatio-temporal scan mechanism, which leverages the geometry of the human skeleton to enhance detail.
  • Figure 4: Illustration of different unidirectional spatio-temporal scan mechanisms.
  • Figure 5: Visualization of SSM map among body joints and frames.
  • ...and 2 more figures