Table of Contents
Fetching ...

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, Wenming Yang

TL;DR

RTMO tackles the real-time multi-person pose estimation challenge by integrating coordinate classification into a YOLO-based one-stage framework. It introduces a Dynamic Coordinate Classifier that uses localized, dynamically allocated 1-D heatmap bins and a sinusoidal bin-encoding scheme, paired with a Maximum Likelihood Estimation loss that learns per-sample uncertainty. The approach yields state-of-the-art results among real-time one-stage estimators, achieving 74.8 AP on COCO val2017 with RTMO-l at 141 FPS on V100, and strong performance on CrowdPose, while maintaining deployment-friendly speed. This work demonstrates that dense-coordinate heatmaps with adaptive binning can achieve top-down-like accuracy in a single forward pass, providing a practical and scalable solution for real-time pose analytics and a robust foundation for dense prediction tasks.

Abstract

Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

TL;DR

RTMO tackles the real-time multi-person pose estimation challenge by integrating coordinate classification into a YOLO-based one-stage framework. It introduces a Dynamic Coordinate Classifier that uses localized, dynamically allocated 1-D heatmap bins and a sinusoidal bin-encoding scheme, paired with a Maximum Likelihood Estimation loss that learns per-sample uncertainty. The approach yields state-of-the-art results among real-time one-stage estimators, achieving 74.8 AP on COCO val2017 with RTMO-l at 141 FPS on V100, and strong performance on CrowdPose, while maintaining deployment-friendly speed. This work demonstrates that dense-coordinate heatmaps with adaptive binning can achieve top-down-like accuracy in a single forward pass, providing a practical and scalable solution for real-time pose analytics and a robust foundation for dense prediction tasks.

Abstract

Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.
Paper Structure (27 sections, 8 equations, 6 figures, 4 tables)

This paper contains 27 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Efficiency and efficacy comparison among real-time pose estimation methods across different inference backends and devices. The radial axes indicate inference speed in Frames Per Second (FPS). The outer circular axis shows Average Precision (AP) on the COCO val2017 dataset. Models marked with $\dagger$ were trained with additional data beyond the COCO train2017.
  • Figure 2: Overview of the RTMO Network Architecture. Its head outputs predictions for the score, bounding box, keypoint coordinates and visibility for each grid cell. The Dynamic Coordinate Classifier translates pose features into K pairs of 1-D heatmaps for both the horizontal and vertical axes, encompassing an expanded region 1.25 times the size of the predicted bounding boxes. From these heatmaps, keypoint coordinates are precisely extracted. K denotes the total number of keypoints.
  • Figure 3: Comparison of RTMO with other real-time multi-person pose estimators. The latency for top-down methods varies depending on the number of instances in the image, as indicated by numerical values in the figures. All models are evaluated without test-time augmentation. $\dagger$ indicates that the model was trained using additional data beyond the COCO train2017 dataset.
  • Figure 4: Visualization of estimated human pose (top) and corresponding heatmaps (bottom).
  • Figure 5: Visualization of (left) OKS showing sample difficulty and (right) learned variance in MLE loss. Red crosses mark the position of grids.
  • ...and 1 more figures