Table of Contents
Fetching ...

A New Teacher-Reviewer-Student Framework for Semi-supervised 2D Human Pose Estimation

Wulian Yun, Mengshi Qi, Fei Peng, Huadong Ma

TL;DR

This work tackles the labeling bottleneck in 2D human pose estimation by introducing a semi-supervised framework that blends a teacher, reviewer, and student. The Teacher-Reviewer-Student architecture leverages unlabeled data through teacher guidance and historical parameter information stored by reviewer networks, while Multi-level Feature Learning and Keypoint-Mix augment supervision and discrimination of keypoints. Training combines supervised losses on labeled data with two consistency-based unsupervised losses on unlabeled data, and reviewer parameters are maintained via EMA to retain training history. Empirical results on COCO, MPII, and AI Challenger demonstrate state-of-the-art improvements under limited labels, with ablations confirming the effectiveness of the proposed components and qualitative analyses verifying improved localization and robustness.

Abstract

Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.

A New Teacher-Reviewer-Student Framework for Semi-supervised 2D Human Pose Estimation

TL;DR

This work tackles the labeling bottleneck in 2D human pose estimation by introducing a semi-supervised framework that blends a teacher, reviewer, and student. The Teacher-Reviewer-Student architecture leverages unlabeled data through teacher guidance and historical parameter information stored by reviewer networks, while Multi-level Feature Learning and Keypoint-Mix augment supervision and discrimination of keypoints. Training combines supervised losses on labeled data with two consistency-based unsupervised losses on unlabeled data, and reviewer parameters are maintained via EMA to retain training history. Empirical results on COCO, MPII, and AI Challenger demonstrate state-of-the-art improvements under limited labels, with ablations confirming the effectiveness of the proposed components and qualitative analyses verifying improved localization and robustness.

Abstract

Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
Paper Structure (16 sections, 12 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustrations of our proposed Teacher-Reviewer-Student framework for semi-supervised 2D HPE task. Unlike fully supervised methods that rely solely on labeled data for pose estimation, our semi-supervised method utilizes both labeled and unlabeled data to estimate human pose. Furthermore, we propose the reviewer network based on the teacher-student framework to provide additional supervisory signals.
  • Figure 2: Overview of our framework. Our method comprises network $\mathcal{G}$, network $\mathcal{F}$ and reviewer networks $\mathcal{R}_1$, $\mathcal{R}_2$. Network $\mathcal{G}$ and network $\mathcal{F}$ both take turns playing the roles of teacher and student. The teacher network generates predicted results for unlabeled data to guide the training of the student network. Reviewer networks retain crucial information from network $\mathcal{G}$ and network $\mathcal{F}$ during the training while providing additional supervision, which parameters are updated from network $\mathcal{G}$ and network $\mathcal{F}$ via EMA. Multi-level Feature Learning indicates upsampling the outputs of the multiple stages of the backbone to estimate the heatmap. Keypoint-Mix is a data augmentation strategy. ${\mathcal{M}_{e \rightarrow h}}$ denotes the mapping of the predicted results of easy augmented data and hard augmented data to the same coordinate space.
  • Figure 3: Illustration of data augmentation strategy Keypoint-Mix. Unlabeled data is fed into the teacher network to generate keypoint predictions, which are then randomly sampled, with image patches extracted from the surrounding regions. Afterward, these image patches are mixed to obtain a blended patch to cover back the original regions.
  • Figure 4: Qualitative comparison of our method and other semi-supervised 2D HPE methods Dual 9710942 and SSPCM Huang_2023_CVPR on COCO VAL dataset, where all models are trained with 1K labeled data using ResNet18 as the backbone. The first and second rows indicate single-person scenario, the third row denotes multiple-person scenario, and the fourth row represents occlusion scenario.
  • Figure 5: Heatmap visualization of two samples from COCO dataset. The columns are arranged from left to right as follows: ground truth (GT), heatmap estimation results of our method without using the Multi-level Feature Learning (w/o MFL), and heatmap estimation results of our full method.
  • ...and 1 more figures