ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Nanjun Li; Pinqi Cheng; Zean Liu; Minghe Tian; Xuanyin Wang

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang

TL;DR

This work revisits box-driven single-stage pose estimation from a keypoint-driven perspective and identifies semantic conflicts among parallel objectives as a key source of performance degradation, and proposes a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective.

Abstract

Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 15 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Multi-Person Pose Estimation Methods
Representation Learning for Pose Estimation
Methodology
Architecture Design
Task Misalignment and Keypoint-driven Assignment Metric
Loss Function
Experiments
Experimental Settings
Comparison with State-of-the-Art Methods on COCO
Structural Ablation Study
Dual-Head Assignment Analysis
Keypoint-Driven Assignment Metric Evaluation
Loss Function Ablation Study
...and 4 more sections

Figures (15)

Figure 1: Performance comparison between ER-Pose and representative human pose estimation methods on the COCO test-dev set. (a) Accuracy vs. inference time. (b) Accuracy vs. number of parameters. ER-Pose demonstrates superior inference efficiency while maintaining competitive accuracy.
Figure 2: Illustration of the typical box-driven pipeline: the bounding-box-centered single-stage pose estimation paradigm.
Figure 3: Overview of the ER-Pose framework. ER-Pose adopts keypoint-driven single-stage architecture. Feature extraction design follows YOLOv8, with CSPDarkNetwang2020cspnetbochkovskiy2020yolov4 used as backbone and PANliu2018path employed as neck to produce multi-scale features. Head consists of human confidence branch and pose branch. Confidence branch adopts DSConvhoward2017mobilenets to reduce computational cost, while keypoint regression employs standard convolution to preserve sufficient representational capacity. Network is trained under MAH-SAH scheme.
Figure 4: Suboptimal keypoint feature selection induced by box-driven modeling. The upper part shows positive samples selected by the network, while the lower part presents samples from the same network that yield more accurate keypoint estimates but are suppressed during inference.
Figure 5: The assignment paradigm
...and 10 more figures

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

TL;DR

Abstract

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)