Table of Contents
Fetching ...

RSPose: Ranking Based Losses for Human Pose Estimation

Muhammed Can Keles, Bedrettin Cetinkaya, Sinan Kalkan, Emre Akbas

TL;DR

RSPose tackles misalignment between training losses and evaluation in heatmap-based human pose estimation by introducing ranking-based Spatial-RS Loss and Instance-Sort Loss. These losses improve the correlation between per-keypoint confidences and localization quality, address heatmap imbalance, and directly optimize mAP-like metrics. The method is validated on 1D and 2D heatmaps across COCO, CrowdPose, and MPII, achieving state-of-the-art AP on COCO-val with ViTPose-H (AP 79.9) and boosting SimCC-ResNet-50 by 1.5 AP to 73.6. The work demonstrates improved robustness and practical benefits for NMS and downstream tasks, marking the first explicit loss designed to align with the mAP objective in pose estimation.

Abstract

While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) "Commonly used Mean Squared Error (MSE)" Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP.

RSPose: Ranking Based Losses for Human Pose Estimation

TL;DR

RSPose tackles misalignment between training losses and evaluation in heatmap-based human pose estimation by introducing ranking-based Spatial-RS Loss and Instance-Sort Loss. These losses improve the correlation between per-keypoint confidences and localization quality, address heatmap imbalance, and directly optimize mAP-like metrics. The method is validated on 1D and 2D heatmaps across COCO, CrowdPose, and MPII, achieving state-of-the-art AP on COCO-val with ViTPose-H (AP 79.9) and boosting SimCC-ResNet-50 by 1.5 AP to 73.6. The work demonstrates improved robustness and practical benefits for NMS and downstream tasks, marking the first explicit loss designed to align with the mAP objective in pose estimation.

Abstract

While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) "Commonly used Mean Squared Error (MSE)" Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP.

Paper Structure

This paper contains 26 sections, 1 theorem, 21 equations, 3 figures, 8 tables.

Key Result

Theorem 1

Let $L_k \in \mathbb{R}$ denote the localization quality for keypoint $k$, and let $C_k \in \mathbb{R}$ denote the model's confidence score for keypoint $k$. Assume independent keypoints, i.e., $\text{Cov}(L_{\cdot,j}, C_{\cdot,m}) = 0$ for $j \ne m$. If $\sum_{k=1}^K \text{Cov}(L_k, C_k)$ increases

Figures (3)

  • Figure 1: We highlight 3 problems when training heatmap based human pose estimation models with MSE. (P1) MSE loss does not always enhance localization, but causes models to learn the non-GT pixels better instead of learning optimal localization. (P2) Heatmaps are very sparse and imbalanced. MSE Loss does not provide balanced gradients for positive and negative pixels. (P3) MSE Loss does not necessarily optimize the ranking alignment between localization qualities and confidences, which is crucial for the evaluation measure mAP. See Section \ref{['sect:how_we_address_problems']} for a detailed discussion on how our methods address these problems.
  • Figure 2: Positive-negative imbalance between positive and negative samples in heatmaps, with an input size of 256×196 and a heatmap size of 64×48. Left: One pixel is positive and other pixels are negative for a joint. Right: A region of pixels around a positive pixel are annotated a degree of positiveness using a Gaussian function centered at the positive pixel.
  • Figure 3: Cardinality imbalance between positive and negative keypoint samples causes imbalance between the gradients of positive and negative keypoint samples. Our proposed method (blue line) addressed this issue, providing a perfect balance between the gradients.

Theorems & Definitions (2)

  • Theorem 1
  • proof