Table of Contents
Fetching ...

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang

TL;DR

This work revisits box-driven single-stage pose estimation from a keypoint-driven perspective and identifies semantic conflicts among parallel objectives as a key source of performance degradation, and proposes a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective.

Abstract

Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

TL;DR

This work revisits box-driven single-stage pose estimation from a keypoint-driven perspective and identifies semantic conflicts among parallel objectives as a key source of performance degradation, and proposes a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective.

Abstract

Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.
Paper Structure (19 sections, 12 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 15 figures, 11 tables, 1 algorithm.

Figures (15)

  • Figure 1: Performance comparison between ER-Pose and representative human pose estimation methods on the COCO test-dev set. (a) Accuracy vs. inference time. (b) Accuracy vs. number of parameters. ER-Pose demonstrates superior inference efficiency while maintaining competitive accuracy.
  • Figure 2: Illustration of the typical box-driven pipeline: the bounding-box-centered single-stage pose estimation paradigm.
  • Figure 3: Overview of the ER-Pose framework. ER-Pose adopts keypoint-driven single-stage architecture. Feature extraction design follows YOLOv8, with CSPDarkNetwang2020cspnetbochkovskiy2020yolov4 used as backbone and PANliu2018path employed as neck to produce multi-scale features. Head consists of human confidence branch and pose branch. Confidence branch adopts DSConvhoward2017mobilenets to reduce computational cost, while keypoint regression employs standard convolution to preserve sufficient representational capacity. Network is trained under MAH-SAH scheme.
  • Figure 4: Suboptimal keypoint feature selection induced by box-driven modeling. The upper part shows positive samples selected by the network, while the lower part presents samples from the same network that yield more accurate keypoint estimates but are suppressed during inference.
  • Figure 5: The assignment paradigm
  • ...and 10 more figures