Table of Contents
Fetching ...

Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation

Ruochen Li, Shuang Chen, Wenke E, Farshad Arvin, Amir Atapour-Abarghouei

Abstract

Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.

Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation

Abstract

Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of temporal dependency modelling strategies and efficiency–accuracy trade-offs. Left: (a) full-sequence, (b) fixed-scale, and (c) the proposed multi-scale temporal modelling. Right: Comparison of recent methods on Human3.6M in terms of MPJPE (lower is better) versus MACs/frame, showing that our method achieves state-of-the-art performance with low computational cost.
  • Figure 2: Overview of the proposed framework. The model integrates (a) a Skeleton-constrained Adaptive GCN (SAGCN) for spatial modelling and (b) an Adaptive Multi-scale Temporal Modelling (AMTM) module for temporal modelling. STGC denotes sparse temporal graph convolution operation.
  • Figure 3: Qualitative comparisons of our method with MotionAGFormer and TCPFormer on in-the-wild videos. We highlight inaccurate or ambiguous 2D detections with light-yellow arrows and indicate the corresponding deviations in the reconstructed 3D poses using orange arrows.
  • Figure 4: Qualitative comparisons between our method and TCPFormer on the Human3.6M dataset for the Sitting and Walk actions. Black dashed circles indicate highlighted regions.
  • Figure 5: Visualisation of the scale weight distribution for the Walk and SittingDown actions on the Human3.6M dataset (243 frames). Body joints are grouped into lower body (hip, knee, ankle), torso (root, spine, thorax, neck, head), and upper body (shoulder, elbow, wrist). Bars represent the average selection weights assigned to three temporal scales: short (9 frames), medium (27 frames), and long (81 frames).