Boosting Semi-Supervised 2D Human Pose Estimation by Revisiting Data Augmentation and Consistency Training
Huayi Zhou, Mukun Luo, Fei Jiang, Yue Ding, Hongtao Lu, Kui Jia
TL;DR
This work tackles the label scarcity in 2D human pose estimation by leveraging unlabeled data through semi-supervised learning. It introduces MultiAugs, a framework that (1) ranks and composes augmentations to maximize their beneficial noise while avoiding over-perturbation, and (2) employs multi-path consistency losses within a single or dual-network setup to exploit multiple hard augmentations efficiently. The authors demonstrate substantial performance gains on standard benchmarks (COCO, MPII, AIC) and in specialized domains (fisheye, hand pose), with improved training efficiency. The results suggest that carefully designed augmentation synergy and concise consistency training generalize across body, hand, and distortion-varied datasets, offering a practical boost for SSL-based SSHPE tasks.
Abstract
The 2D human pose estimation (HPE) is a basic visual problem. However, its supervised learning requires massive keypoint labels, which is labor-intensive to collect. Thus, we aim at boosting a pose estimator by excavating extra unlabeled data with semi-supervised learning (SSL). Most previous SSHPE methods are consistency-based and strive to maintain consistent outputs for differently augmented inputs. Under this genre, we find that SSHPE can be boosted from two cores: advanced data augmentations and concise consistency training ways. Specifically, for the first core, we discover the synergistic effects of existing augmentations, and reveal novel paradigms for conveniently producing new superior HPE-oriented augmentations which can more effectively add noise on unlabeled samples. We can therefore establish paired easy-hard augmentations with larger difficulty gaps. For the second core, we propose to repeatedly augment unlabeled images with diverse hard augmentations, and generate multi-path predictions sequentially for optimizing multi-losses in a single network. This simple and compact design is interpretable, and easily benefits from newly found augmentations. Comparing to state-of-the-art SSL approaches, our method brings substantial improvements on public datasets. And we extensively validate the superiority and versatility of our approach on conventional human body images, overhead fisheye images, and human hand images. The code is released in https://github.com/hnuzhy/MultiAugs.
