Table of Contents
Fetching ...

Optimizing Local-Global Dependencies for Accurate 3D Human Pose Estimation

Guangsheng Xu, Guoyi Zhang, Lejia Ye, Shuwei Gan, Xiaohu Zhang, Xia Yang

TL;DR

SSR-STF introduces a dual-stream framework that fuses local-skeletal details (via SSRFormer and Skeleton Selective Refine Attention) with global spatio-temporal dependencies (via STFormer) to improve monocular 3D human pose estimation. The method jointly learns fine-grained local features and long-range context, with an adaptive fusion strategy and a large-kernel decomposition approach to capture skeletal dynamics efficiently. On Human3.6M and MPI-INF-3DHP, SSR-STF achieves state-of-the-art MPJPE/P1 and PCK/AUC metrics, while also delivering strong motion representations for downstream tasks like SMPL-based mesh recovery. These results demonstrate robust generalization and practical impact for pose estimation and motion analysis in real-world applications.

Abstract

Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism to capture fine-grained local dependencies in human pose sequences, complementing the global dependencies modeled by the Transformer. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance, with P1 errors of 37.4 mm and 13.2 mm respectively, outperforming existing methods in both accuracy and generalization. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at https://github.com/poker-xu/SSR-STF.

Optimizing Local-Global Dependencies for Accurate 3D Human Pose Estimation

TL;DR

SSR-STF introduces a dual-stream framework that fuses local-skeletal details (via SSRFormer and Skeleton Selective Refine Attention) with global spatio-temporal dependencies (via STFormer) to improve monocular 3D human pose estimation. The method jointly learns fine-grained local features and long-range context, with an adaptive fusion strategy and a large-kernel decomposition approach to capture skeletal dynamics efficiently. On Human3.6M and MPI-INF-3DHP, SSR-STF achieves state-of-the-art MPJPE/P1 and PCK/AUC metrics, while also delivering strong motion representations for downstream tasks like SMPL-based mesh recovery. These results demonstrate robust generalization and practical impact for pose estimation and motion analysis in real-world applications.

Abstract

Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism to capture fine-grained local dependencies in human pose sequences, complementing the global dependencies modeled by the Transformer. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance, with P1 errors of 37.4 mm and 13.2 mm respectively, outperforming existing methods in both accuracy and generalization. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at https://github.com/poker-xu/SSR-STF.
Paper Structure (22 sections, 13 equations, 5 figures, 7 tables)

This paper contains 22 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: MPJPE Comparison on Human3.6M dataset ionescu2013human3 (lower is better). The horizontal and vertical axes represent MPJPE using detected 2D poses and GT 2D poses as inputs, respectively. Using GT 2D poses as input can evaluate the performance upper bound of 2D-to-3D lifting models. The number of parameters for each algorithm is also provided. Our model strikes a balance between performance and parameter efficiency, achieving a new SOTA.
  • Figure 2: $(a)$ The overall architecture of SSR-STF, which is characterized by $N$ dual-stream spatio-temporal blocks, includes one stream leveraging SSRFormers and the other employing STFormers. $(b)$ Network structure of the Spatial/Temporal SSRFormer. SSRFormer employs skeleton selective refine attention mechanism (i.e., SSRA) to capture the local spatio-temporal features of skeleton sequences. $(c)$ Network structure of the STFormer. STFormer adopts self-attention mechanism, excelling at capturing global dependencies.
  • Figure 3: SSRFormer. We employ skeleton selective refine attention mechanism to extract the spatio-temporal local features of 2D joints, as illustrated by the SSRFormer with a kernel size of $k_{1} \times k_{2}$.
  • Figure 4: The MPJPE distribution on Human3.6M testset, with the estimated 2D pose as input and $T=27$. The horizontal axis represents the error interval, while the vertical axis shows the proportion of poses within each error interval.
  • Figure 5: Qualitative comparisons of 3D pose estimation by MixSTE zhang2022mixste, MotionBERT zhu2023motionbert, MotionAGFormer mehraban2024motionagformer and our SSR-STF. The gray skeleton is the ground-truth 3D pose. Blue, orange and green skeletons indicate the left part, right part, and torso of the estimated body, respectively.