Table of Contents
Fetching ...

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse, Boeun Kim, Yi Chang, Yixing Gao

TL;DR

This work introduces GLSMamba, a pure Mamba-based framework for video-based human pose estimation that decouples global and local high-resolution spatiotemporal representations. It develops a Global Spatiotemporal Mamba (GSM) with a 6D selective Space-Time Scan and adaptive STMM fusion to capture holistic dynamics, and a Local Refinement Mamba (LRM) with Windowed Space-Time Scan to recover fine-grained local motion details, enabling efficient high-resolution modeling. Extensive experiments on PoseTrack2017/2018/21 and Sub-JHMDB demonstrate state-of-the-art performance with favorable computational trade-offs, and ablations confirm the effectiveness of each component (GSM, LRM, STS6D, STMM, and WSTS). The approach advances VHPE by leveraging linear-complexity Mamba-based spatiotemporal reasoning, with potential extensions to 3D pose estimation and video segmentation that could benefit from global-local high-resolution representations.

Abstract

Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

TL;DR

This work introduces GLSMamba, a pure Mamba-based framework for video-based human pose estimation that decouples global and local high-resolution spatiotemporal representations. It develops a Global Spatiotemporal Mamba (GSM) with a 6D selective Space-Time Scan and adaptive STMM fusion to capture holistic dynamics, and a Local Refinement Mamba (LRM) with Windowed Space-Time Scan to recover fine-grained local motion details, enabling efficient high-resolution modeling. Extensive experiments on PoseTrack2017/2018/21 and Sub-JHMDB demonstrate state-of-the-art performance with favorable computational trade-offs, and ablations confirm the effectiveness of each component (GSM, LRM, STS6D, STMM, and WSTS). The approach advances VHPE by leveraging linear-complexity Mamba-based spatiotemporal reasoning, with potential extensions to 3D pose estimation and video segmentation that could benefit from global-local high-resolution representations.

Abstract

Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.

Paper Structure

This paper contains 16 sections, 9 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: State-of-the-art methods such as (a) TDMI feng2023mutual and (b) DiffPose feng2023diffpose focus either on global or local spatiotemporal contexts, which may fail for occlusion or blur cases. Our method (c) fully exploits both global and local high-resolution spatiotemporal representations, delivering more robust results.
  • Figure 2: Overall pipeline of the proposed framework. Given an input sequence, we first extract high-resolution spatial features for each frame using a visual encoder. Then, these features are processed successively by GSM and LRM for global spatiotemporal modeling and local detail enhancement. Finally, a detection head is employed to yield the pose heatmap estimations.
  • Figure 3: Visualizations of activation maps of STS6D.
  • Figure 4: Visual results of our method on benchmarks. Challenging scenes such as occlusion and motion blur are involved.
  • Figure 5: Qualitative comparisons of pose predictions of (a) GLSMamba-B, (b) TDMI, and (c) DiffPose on the PoseTrack dataset. Inaccurate results are highlighted by red circles.
  • ...and 4 more figures