Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Jianbin Jiao; Xina Cheng; Weijie Chen; Xiaoting Yin; Hao Shi; Kailun Yang

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Jianbin Jiao, Xina Cheng, Weijie Chen, Xiaoting Yin, Hao Shi, Kailun Yang

TL;DR

This work tackles 3D human pose estimation from multi-view video by introducing a two-branch transformer-based framework that separately captures intra-frame spatial features (Spatial Module) and inter-frame temporal plus 3D spatial relations (Image Relations Module). By aggregating frame-level information into compact tokens and applying both windowed and global self-attention, the approach efficiently models long-range dependencies and occlusion-robust cues, achieving state-of-the-art results on Human3.6M. The method improves 2D pose accuracy and, when combined with PoseFormer for 3D reconstruction, yields notable reductions in MPJPE and P-MPJPE, with longer frame sequences further enhancing performance. These results demonstrate the practicality of multi-perspective spatial-temporal relational transformers for precise 3D pose estimation in video data, with potential for real-time applications after further optimization.

Abstract

3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose.

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

TL;DR

Abstract

Paper Structure (17 sections, 14 equations, 4 figures, 3 tables)

This paper contains 17 sections, 14 equations, 4 figures, 3 tables.

Introduction
Related Work
2D Human Pose Detection
3D Human Pose Estimation
Vision Transformer
Method
Overview Architecture
Motivation
Spatial Module
Image Relations Module
Experiments
Datasets
Implementation Details
Evaluation of 2D Human Body Detection Results
Evaluation of 3D Human Pose Detection Results
...and 2 more sections

Figures (4)

Figure 1: The conceptual diagram of our approach consists of two main components: the Spatial Module and the Image Relations Module. The Spatial Module extracts human body pose features inherent in the images themselves, while the Image Relations Module first models the temporal relationships between frame images and subsequently models the spatial positional relationships between corresponding images in 3D space.
Figure 2: A network architecture is proposed for extracting 2D human body poses using the self-attention mechanism, comprising two modules. The Spatial Module extracts pose features from images with windowed self-attention. The Image Relations Module extracts temporal relationships and 3D spatial features from video frames, using global self-attention to learn sequence-wide relationships. The final output is the 2D human body pose, utilized to estimate 3D poses.
Figure 3: (a) Mobile Window Attention Module: This module limits self-attention computations to a small window. To capture global features, the window moves across the image, conducting self-attention computations within each image block generated after each movement. (b) Standard Transformer Module.
Figure 4: The detection outcomes of our methodology involve the examination of all images within the Human3.6M dataset. Our approach generates 2D poses for these images, which are subsequently employed as inputs to the 3D Pose detection network, PoseFormer zheng2021_3d_hpe. The network then produces corresponding 3D poses.

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

TL;DR

Abstract

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)