Semi-supervised 2D Human Pose Estimation via Adaptive Keypoint Masking
Kexin Meng, Ruirui Li, Daguang Jiang
TL;DR
This work tackles label scarcity in 2D human pose estimation by introducing adaptive keypoint masking within a semi-supervised teacher-student framework, coupled with dual-branch Mixup augmentation to enforce smoothness and manifold assumptions. The adaptive masking uses heatmap responses $H_i^j$ to compute $H_i^{max}$, $H_i^{min}$, and relative scores $r_i^j$, allocating masks per sample based on difficulty and treating extremely difficult samples with minimal masking. The method employs two strong augmentation branches and a Mixup-based loss $L_m=\alpha \mathbb{E}\| f(X_m)-H_i\|^2+(1-\alpha) \mathbb{E}\| f(X_m)-H_j\|^2$, all under the unified loss $L_{total}=L_s+\lambda_u L_u+\lambda_m L_m$. Experiments on COCO and MPII show substantial improvements over prior semi-supervised approaches (e.g., +5.2 AP on COCO with 1K labeled data and +0.3 AP on MPII), with ablations confirming the effectiveness of adaptive masking, the Mixup augmentation, and the placement of mixing operations. These contributions reduce labeling needs and improve robustness to pose diversity, advancing practical 2D pose estimation in real-world data regimes.
Abstract
Human pose estimation is a fundamental and challenging task in computer vision. Larger-scale and more accurate keypoint annotations, while helpful for improving the accuracy of supervised pose estimation, are often expensive and difficult to obtain. Semi-supervised pose estimation tries to leverage a large amount of unlabeled data to improve model performance, which can alleviate the problem of insufficient labeled samples. The latest semi-supervised learning usually adopts a strong and weak data augmented teacher-student learning framework to deal with the challenge of "Human postural diversity and its long-tailed distribution". Appropriate data augmentation method is one of the key factors affecting the accuracy and generalization of semi-supervised models. Aiming at the problem that the difference of sample learning is not considered in the fixed keypoint masking augmentation method, this paper proposes an adaptive keypoint masking method, which can fully mine the information in the samples and obtain better estimation performance. In order to further improve the generalization and robustness of the model, this paper proposes a dual-branch data augmentation scheme, which can perform Mixup on samples and features on the basis of adaptive keypoint masking. The effectiveness of the proposed method is verified on COCO and MPII, outperforming the state-of-the-art semi-supervised pose estimation by 5.2% and 0.3%, respectively.
