Table of Contents
Fetching ...

SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

Hu Cui, Wenqiang Hua, Renjing Huang, Shurui Jia, Tessai Hayama

TL;DR

SasMamba tackles monocular 3D human pose estimation by preserving skeletal topology through Skeleton Structure-Aware Stride SSM (SAS-SSM). It combines a structure-aware spatiotemporal convolution with a stride-based scan to build multi-scale global representations while maintaining linear computational complexity, and integrates this into a lightweight SasMamba model. The approach achieves competitive or state-of-the-art results on Human3.6M and MPI-INF-3DHP with far fewer parameters and computations than Transformer-based or hybrid architectures, demonstrating strong efficiency and scalability. This structure-aware, multi-scale SSM framework offers practical benefits for real-time or resource-constrained 3D pose estimation while preserving spatial integrity and long-range dependencies.

Abstract

Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies. To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity. Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models. The source code is available at https://hucui2022.github.io/sasmamba_proj/.

SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation

TL;DR

SasMamba tackles monocular 3D human pose estimation by preserving skeletal topology through Skeleton Structure-Aware Stride SSM (SAS-SSM). It combines a structure-aware spatiotemporal convolution with a stride-based scan to build multi-scale global representations while maintaining linear computational complexity, and integrates this into a lightweight SasMamba model. The approach achieves competitive or state-of-the-art results on Human3.6M and MPI-INF-3DHP with far fewer parameters and computations than Transformer-based or hybrid architectures, demonstrating strong efficiency and scalability. This structure-aware, multi-scale SSM framework offers practical benefits for real-time or resource-constrained 3D pose estimation while preserving spatial integrity and long-range dependencies.

Abstract

Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies. To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity. Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models. The source code is available at https://hucui2022.github.io/sasmamba_proj/.

Paper Structure

This paper contains 23 sections, 17 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparison of Pose Sequence Processing: Flattened Scan vs. Structure-Aware Scan.
  • Figure 2: The overall framework of SasMamba. The input 2D keypoint sequence is projected into a high-dimensional space, enhanced with positional and temporal embeddings, and processed by SasMamba blocks. To ensure consistent sequence length during Spatial Stride Sampling, invalid tokens (in gray) are replaced with the most recent valid joint token.
  • Figure 3: Qualitative comparisons of our proposed SasMamba with PoseMamba-S huang2025posemamba, MotionAGFormer-S mehraban2024motionagformer, and HGMamba-S cui2025hgmamba on 3D human pose estimation. The solid purple skeletons represent the ground-truth 3D poses, while the dashed green skeletons indicate the predicted 3D poses.
  • Figure 4: Qualitative comparisons with PoseMamba-S huang2025posemamba, MotionAGFormer-S mehraban2024motionagformer, and HGMamba-S cui2025hgmamba on challenging in-the-wild videos. Red arrows indicate accurate estimations, while gray arrows highlight unsatisfactory estimations.
  • Figure 5: Qualitative Results on Mildly Challenging Wild Videos. Red arrows highlight erroneous 2D pose estimations, while green arrows indicate correct 3D predictions.
  • ...and 3 more figures