Table of Contents
Fetching ...

Boosting Semi-Supervised 2D Human Pose Estimation by Revisiting Data Augmentation and Consistency Training

Huayi Zhou, Mukun Luo, Fei Jiang, Yue Ding, Hongtao Lu, Kui Jia

TL;DR

This work tackles the label scarcity in 2D human pose estimation by leveraging unlabeled data through semi-supervised learning. It introduces MultiAugs, a framework that (1) ranks and composes augmentations to maximize their beneficial noise while avoiding over-perturbation, and (2) employs multi-path consistency losses within a single or dual-network setup to exploit multiple hard augmentations efficiently. The authors demonstrate substantial performance gains on standard benchmarks (COCO, MPII, AIC) and in specialized domains (fisheye, hand pose), with improved training efficiency. The results suggest that carefully designed augmentation synergy and concise consistency training generalize across body, hand, and distortion-varied datasets, offering a practical boost for SSL-based SSHPE tasks.

Abstract

The 2D human pose estimation (HPE) is a basic visual problem. However, its supervised learning requires massive keypoint labels, which is labor-intensive to collect. Thus, we aim at boosting a pose estimator by excavating extra unlabeled data with semi-supervised learning (SSL). Most previous SSHPE methods are consistency-based and strive to maintain consistent outputs for differently augmented inputs. Under this genre, we find that SSHPE can be boosted from two cores: advanced data augmentations and concise consistency training ways. Specifically, for the first core, we discover the synergistic effects of existing augmentations, and reveal novel paradigms for conveniently producing new superior HPE-oriented augmentations which can more effectively add noise on unlabeled samples. We can therefore establish paired easy-hard augmentations with larger difficulty gaps. For the second core, we propose to repeatedly augment unlabeled images with diverse hard augmentations, and generate multi-path predictions sequentially for optimizing multi-losses in a single network. This simple and compact design is interpretable, and easily benefits from newly found augmentations. Comparing to state-of-the-art SSL approaches, our method brings substantial improvements on public datasets. And we extensively validate the superiority and versatility of our approach on conventional human body images, overhead fisheye images, and human hand images. The code is released in https://github.com/hnuzhy/MultiAugs.

Boosting Semi-Supervised 2D Human Pose Estimation by Revisiting Data Augmentation and Consistency Training

TL;DR

This work tackles the label scarcity in 2D human pose estimation by leveraging unlabeled data through semi-supervised learning. It introduces MultiAugs, a framework that (1) ranks and composes augmentations to maximize their beneficial noise while avoiding over-perturbation, and (2) employs multi-path consistency losses within a single or dual-network setup to exploit multiple hard augmentations efficiently. The authors demonstrate substantial performance gains on standard benchmarks (COCO, MPII, AIC) and in specialized domains (fisheye, hand pose), with improved training efficiency. The results suggest that carefully designed augmentation synergy and concise consistency training generalize across body, hand, and distortion-varied datasets, offering a practical boost for SSL-based SSHPE tasks.

Abstract

The 2D human pose estimation (HPE) is a basic visual problem. However, its supervised learning requires massive keypoint labels, which is labor-intensive to collect. Thus, we aim at boosting a pose estimator by excavating extra unlabeled data with semi-supervised learning (SSL). Most previous SSHPE methods are consistency-based and strive to maintain consistent outputs for differently augmented inputs. Under this genre, we find that SSHPE can be boosted from two cores: advanced data augmentations and concise consistency training ways. Specifically, for the first core, we discover the synergistic effects of existing augmentations, and reveal novel paradigms for conveniently producing new superior HPE-oriented augmentations which can more effectively add noise on unlabeled samples. We can therefore establish paired easy-hard augmentations with larger difficulty gaps. For the second core, we propose to repeatedly augment unlabeled images with diverse hard augmentations, and generate multi-path predictions sequentially for optimizing multi-losses in a single network. This simple and compact design is interpretable, and easily benefits from newly found augmentations. Comparing to state-of-the-art SSL approaches, our method brings substantial improvements on public datasets. And we extensively validate the superiority and versatility of our approach on conventional human body images, overhead fisheye images, and human hand images. The code is released in https://github.com/hnuzhy/MultiAugs.
Paper Structure (31 sections, 7 equations, 14 figures, 15 tables)

This paper contains 31 sections, 7 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Frameworks of existing semi-supervised human pose estimation (SSHPE) methods including (a) Single-Network and (b) Dual-Network which is originally proposed by xie2021empirical, and (c) Triple-Network that is proposed by huang2023semi.
  • Figure 2: Comparison of applying different easy-hard pairs for training a Single-Network model as in Fig. \ref{['OldA']}. We can rank these six augmentations indisputably based on either best mAP results or distinct convergence curves.
  • Figure 3: Illustrations of superior combinations $T_{JOCO}$ and $T_{JCCM}$. Either of them is a sequential operations of ready-made collaborative augmentations. $T_{JO}$ and $T_{CM}$ need extra patches cropped from other images which are not displayed.
  • Figure 4: Best mAPs of different combinations.
  • Figure 5: The corresponding convergence curves of combinations in Tab. \ref{['tabAugsPlusAS']}.
  • ...and 9 more figures