Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers
Jianbin Jiao, Xina Cheng, Weijie Chen, Xiaoting Yin, Hao Shi, Kailun Yang
TL;DR
This work tackles 3D human pose estimation from multi-view video by introducing a two-branch transformer-based framework that separately captures intra-frame spatial features (Spatial Module) and inter-frame temporal plus 3D spatial relations (Image Relations Module). By aggregating frame-level information into compact tokens and applying both windowed and global self-attention, the approach efficiently models long-range dependencies and occlusion-robust cues, achieving state-of-the-art results on Human3.6M. The method improves 2D pose accuracy and, when combined with PoseFormer for 3D reconstruction, yields notable reductions in MPJPE and P-MPJPE, with longer frame sequences further enhancing performance. These results demonstrate the practicality of multi-perspective spatial-temporal relational transformers for precise 3D pose estimation in video data, with potential for real-time applications after further optimization.
Abstract
3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose.
