Table of Contents
Fetching ...

Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach

Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

The paper addresses inconsistencies in skeleton annotations across pose datasets by learning a unified, 21-point skeleton through a student model guided by multiple dataset-specific teachers. It introduces pose union learning combined with multi-teacher distillation, training on a merged COCO+MPII corpus and evaluating against dataset-specific baselines as well as Halpe for extended keypoints. The approach delivers strong cross-dataset generalization (e.g., average PCK/AP improvements to 70.89/76.40 over 53.79/55.78) and demonstrates effective prediction of joints not present in individual ground-truths, including Halpe-based 26-point coverage. This work enables broader, more versatile pose understanding with potential for annotation enrichment and active learning-driven dataset enhancement.

Abstract

Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.

Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach

TL;DR

The paper addresses inconsistencies in skeleton annotations across pose datasets by learning a unified, 21-point skeleton through a student model guided by multiple dataset-specific teachers. It introduces pose union learning combined with multi-teacher distillation, training on a merged COCO+MPII corpus and evaluating against dataset-specific baselines as well as Halpe for extended keypoints. The approach delivers strong cross-dataset generalization (e.g., average PCK/AP improvements to 70.89/76.40 over 53.79/55.78) and demonstrates effective prediction of joints not present in individual ground-truths, including Halpe-based 26-point coverage. This work enables broader, more versatile pose understanding with potential for annotation enrichment and active learning-driven dataset enhancement.

Abstract

Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.
Paper Structure (15 sections, 6 equations, 2 figures, 5 tables)

This paper contains 15 sections, 6 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Illustration of our proposed learning approach, showing how information from multiple datasets is integrated to create the unified skeleton. Our student learns from both datasets and one pretrained teacher for each dataset using a combination of distillation losses and conditional keypoints loss.
  • Figure 2: Qualitative Results: Predicted skeletons on different images show the accuracy of our pose estimation model across different scenarios. Notably, our model adds five new points (nose, eyes, ears) to the MPII images (top) and four new points (pelvis, thorax, neck, head top) to the COCO images (bottom). These points are highlighted in red ($\textcolor{red}{\bullet}$) in the figure.