High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
Runyang Feng, Hyung Jin Chang, Tze Ho Elden Tse, Boeun Kim, Yi Chang, Yixing Gao
TL;DR
This work introduces GLSMamba, a pure Mamba-based framework for video-based human pose estimation that decouples global and local high-resolution spatiotemporal representations. It develops a Global Spatiotemporal Mamba (GSM) with a 6D selective Space-Time Scan and adaptive STMM fusion to capture holistic dynamics, and a Local Refinement Mamba (LRM) with Windowed Space-Time Scan to recover fine-grained local motion details, enabling efficient high-resolution modeling. Extensive experiments on PoseTrack2017/2018/21 and Sub-JHMDB demonstrate state-of-the-art performance with favorable computational trade-offs, and ablations confirm the effectiveness of each component (GSM, LRM, STS6D, STMM, and WSTS). The approach advances VHPE by leveraging linear-complexity Mamba-based spatiotemporal reasoning, with potential extensions to 3D pose estimation and video segmentation that could benefit from global-local high-resolution representations.
Abstract
Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.
