RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation
Tao Jiang, Xinchen Xie, Yining Li
TL;DR
RTMW addresses the problem of real-time, multi-person whole-body pose estimation by extending the RTMPose baseline with PAFPN and a Hierarchical Encoding Module to preserve fine-grained detail across body parts. It leverages SimCC coordinate classification to predict 2D keypoints on a $133$-point COCO-Wholebody schema, trained via joint multi-dataset supervision and a two-stage distillation strategy. The approach is further extended to monocular 3D pose estimation (RTMW3D) by adding a z-axis branch and root-relative z-coordinate scheme, achieving competitive results on H3WB and strong 2D performance on COCO-Wholebody. Experiments demonstrate state-of-the-art or near state-of-the-art accuracy with favorable real-time inference speeds, and the work provides open-source code and models for practical deployment. Overall, RTMW/RTMW3D offer robust, deployable solutions for industry and research applications in video analysis and synthetic content generation.
Abstract
Whole-body pose estimation is a challenging task that requires simultaneous prediction of keypoints for the body, hands, face, and feet. Whole-body pose estimation aims to predict fine-grained pose information for the human body, including the face, torso, hands, and feet, which plays an important role in the study of human-centric perception and generation and in various applications. In this work, we present RTMW (Real-Time Multi-person Whole-body pose estimation models), a series of high-performance models for 2D/3D whole-body pose estimation. We incorporate RTMPose model architecture with FPN and HEM (Hierarchical Encoding Module) to better capture pose information from different body parts with various scales. The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy. RTMW demonstrates strong performance on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness. We release three sizes: m/l/x, with RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark. Meanwhile, we explored the performance of RTMW in the task of 3D whole-body pose estimation, conducting image-based monocular 3D whole-body pose estimation in a coordinate classification manner. We hope this work can benefit both academic research and industrial applications. The code and models have been made publicly available at: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmpose
