Table of Contents
Fetching ...

Semi-supervised 2D Human Pose Estimation via Adaptive Keypoint Masking

Kexin Meng, Ruirui Li, Daguang Jiang

TL;DR

This work tackles label scarcity in 2D human pose estimation by introducing adaptive keypoint masking within a semi-supervised teacher-student framework, coupled with dual-branch Mixup augmentation to enforce smoothness and manifold assumptions. The adaptive masking uses heatmap responses $H_i^j$ to compute $H_i^{max}$, $H_i^{min}$, and relative scores $r_i^j$, allocating masks per sample based on difficulty and treating extremely difficult samples with minimal masking. The method employs two strong augmentation branches and a Mixup-based loss $L_m=\alpha \mathbb{E}\| f(X_m)-H_i\|^2+(1-\alpha) \mathbb{E}\| f(X_m)-H_j\|^2$, all under the unified loss $L_{total}=L_s+\lambda_u L_u+\lambda_m L_m$. Experiments on COCO and MPII show substantial improvements over prior semi-supervised approaches (e.g., +5.2 AP on COCO with 1K labeled data and +0.3 AP on MPII), with ablations confirming the effectiveness of adaptive masking, the Mixup augmentation, and the placement of mixing operations. These contributions reduce labeling needs and improve robustness to pose diversity, advancing practical 2D pose estimation in real-world data regimes.

Abstract

Human pose estimation is a fundamental and challenging task in computer vision. Larger-scale and more accurate keypoint annotations, while helpful for improving the accuracy of supervised pose estimation, are often expensive and difficult to obtain. Semi-supervised pose estimation tries to leverage a large amount of unlabeled data to improve model performance, which can alleviate the problem of insufficient labeled samples. The latest semi-supervised learning usually adopts a strong and weak data augmented teacher-student learning framework to deal with the challenge of "Human postural diversity and its long-tailed distribution". Appropriate data augmentation method is one of the key factors affecting the accuracy and generalization of semi-supervised models. Aiming at the problem that the difference of sample learning is not considered in the fixed keypoint masking augmentation method, this paper proposes an adaptive keypoint masking method, which can fully mine the information in the samples and obtain better estimation performance. In order to further improve the generalization and robustness of the model, this paper proposes a dual-branch data augmentation scheme, which can perform Mixup on samples and features on the basis of adaptive keypoint masking. The effectiveness of the proposed method is verified on COCO and MPII, outperforming the state-of-the-art semi-supervised pose estimation by 5.2% and 0.3%, respectively.

Semi-supervised 2D Human Pose Estimation via Adaptive Keypoint Masking

TL;DR

This work tackles label scarcity in 2D human pose estimation by introducing adaptive keypoint masking within a semi-supervised teacher-student framework, coupled with dual-branch Mixup augmentation to enforce smoothness and manifold assumptions. The adaptive masking uses heatmap responses to compute , , and relative scores , allocating masks per sample based on difficulty and treating extremely difficult samples with minimal masking. The method employs two strong augmentation branches and a Mixup-based loss , all under the unified loss . Experiments on COCO and MPII show substantial improvements over prior semi-supervised approaches (e.g., +5.2 AP on COCO with 1K labeled data and +0.3 AP on MPII), with ablations confirming the effectiveness of adaptive masking, the Mixup augmentation, and the placement of mixing operations. These contributions reduce labeling needs and improve robustness to pose diversity, advancing practical 2D pose estimation in real-world data regimes.

Abstract

Human pose estimation is a fundamental and challenging task in computer vision. Larger-scale and more accurate keypoint annotations, while helpful for improving the accuracy of supervised pose estimation, are often expensive and difficult to obtain. Semi-supervised pose estimation tries to leverage a large amount of unlabeled data to improve model performance, which can alleviate the problem of insufficient labeled samples. The latest semi-supervised learning usually adopts a strong and weak data augmented teacher-student learning framework to deal with the challenge of "Human postural diversity and its long-tailed distribution". Appropriate data augmentation method is one of the key factors affecting the accuracy and generalization of semi-supervised models. Aiming at the problem that the difference of sample learning is not considered in the fixed keypoint masking augmentation method, this paper proposes an adaptive keypoint masking method, which can fully mine the information in the samples and obtain better estimation performance. In order to further improve the generalization and robustness of the model, this paper proposes a dual-branch data augmentation scheme, which can perform Mixup on samples and features on the basis of adaptive keypoint masking. The effectiveness of the proposed method is verified on COCO and MPII, outperforming the state-of-the-art semi-supervised pose estimation by 5.2% and 0.3%, respectively.
Paper Structure (14 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overall overview of our Semi-supervised 2D Human Estimaiton method. The green part represents the process of supervised training using labeled data, which generates heatmaps through label-guided supervision. The red and blue parts refer to the unsupervised training process based on the teacher-student framework. The red part serves as the teacher, and the heatmaps estimated from weakly augmented unlabeled data guide the learning of the student part. The blue part acts as the student, and we propose two novel data augmentation approaches: Adaptive Keypoint Masking augmentation and Mixup augmentation. These two strong augmentation components are supervised by pseudo heatmaps. The supervised part and the unsupervised teacher-student part share the same network parameters.
  • Figure 2: The calculation process for allocating the quantity of adaptive keypoint masks. The weakly augmented image $i$ is passed through the network to produce the heatmaps of the number of keypoints $K$. The maximum value in each heatmap is taken as the heatmap's response $H_{i}^{j}$, and the relative response of each heatmap $r_{i}^{j}$ is calculated based on the formula in the figure. By setting a threshold and using the relative response, the keypoints are divided into two categories: simple and difficult. The quantity of masks allocated to each sample is calculated based on the proportion of keypoints classified as simple among the total number of keypoints.
  • Figure 3: The effect of adaptive keypoint masking. In this example, human poses are indistinguishable. Our method assigns it fewer masks, and the network can acquire more semantic information. Better results can be obtained for keypoint estimation.
  • Figure 4: The effect of adaptive keypoint masking. The human pose in this image is more explicit, and our method assigns more keypoint masks to it. It can be seen from the heatmap that the network can mine more deep information based on more masks and obtain better estimation results.
  • Figure 5: Schematic diagram of AP performance and Training loss of Random Masking method and Adaptive Masking method.
  • ...and 1 more figures