Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach
Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal
TL;DR
The paper addresses inconsistencies in skeleton annotations across pose datasets by learning a unified, 21-point skeleton through a student model guided by multiple dataset-specific teachers. It introduces pose union learning combined with multi-teacher distillation, training on a merged COCO+MPII corpus and evaluating against dataset-specific baselines as well as Halpe for extended keypoints. The approach delivers strong cross-dataset generalization (e.g., average PCK/AP improvements to 70.89/76.40 over 53.79/55.78) and demonstrates effective prediction of joints not present in individual ground-truths, including Halpe-based 26-point coverage. This work enables broader, more versatile pose understanding with potential for annotation enrichment and active learning-driven dataset enhancement.
Abstract
Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.
