RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Tao Jiang; Xinchen Xie; Yining Li

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Tao Jiang, Xinchen Xie, Yining Li

TL;DR

RTMW addresses the problem of real-time, multi-person whole-body pose estimation by extending the RTMPose baseline with PAFPN and a Hierarchical Encoding Module to preserve fine-grained detail across body parts. It leverages SimCC coordinate classification to predict 2D keypoints on a $133$-point COCO-Wholebody schema, trained via joint multi-dataset supervision and a two-stage distillation strategy. The approach is further extended to monocular 3D pose estimation (RTMW3D) by adding a z-axis branch and root-relative z-coordinate scheme, achieving competitive results on H3WB and strong 2D performance on COCO-Wholebody. Experiments demonstrate state-of-the-art or near state-of-the-art accuracy with favorable real-time inference speeds, and the work provides open-source code and models for practical deployment. Overall, RTMW/RTMW3D offer robust, deployable solutions for industry and research applications in video analysis and synthetic content generation.

Abstract

Whole-body pose estimation is a challenging task that requires simultaneous prediction of keypoints for the body, hands, face, and feet. Whole-body pose estimation aims to predict fine-grained pose information for the human body, including the face, torso, hands, and feet, which plays an important role in the study of human-centric perception and generation and in various applications. In this work, we present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation. We incorporate RTMPose model architecture with FPN and HEM (Hierarchical Encoding Module) to better capture pose information from different body parts with various scales. The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy. RTMW demonstrates strong performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness. We release three sizes: m/l/x, with RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark. Meanwhile, we explored the performance of RTMW in the task of 3D whole-body pose estimation, conducting image-based monocular 3D whole-body pose estimation in a coordinate classification manner. We hope this work can benefit both academic research and industrial applications. The code and models have been made publicly available at: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmpose

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

TL;DR

-point COCO-Wholebody schema, trained via joint multi-dataset supervision and a two-stage distillation strategy. The approach is further extended to monocular 3D pose estimation (RTMW3D) by adding a z-axis branch and root-relative z-coordinate scheme, achieving competitive results on H3WB and strong 2D performance on COCO-Wholebody. Experiments demonstrate state-of-the-art or near state-of-the-art accuracy with favorable real-time inference speeds, and the work provides open-source code and models for practical deployment. Overall, RTMW/RTMW3D offer robust, deployable solutions for industry and research applications in video analysis and synthetic content generation.

Abstract

Paper Structure (25 sections, 4 figures, 4 tables)

This paper contains 25 sections, 4 figures, 4 tables.

Introduction
Related Work
Top-down Approaches.
Coordinate Classification.
3D pose estimation
Model Architecture and Training
RTMW
Task Limitation
Model Architecture
PAFPN
HEM (Hierarchical Encoding Module)
Training Techniques
RTMW3D
Task definition
Data process
...and 10 more sections

Figures (4)

Figure 1: The RTMW arch.
Figure 2: The task definition of 3d pose estimation.
Figure 3: 2D visualization results
Figure 4: RTMW3D inference results (The 2D and 3D keypoints are both obtained from the RTMW3D model after a single inference.)

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

TL;DR

Abstract

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)