Table of Contents
Fetching ...

Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation

Sarosij Bose, Hannah Dela Cruz, Arindam Dutta, Elena Kokkoni, Konstantinos Karydis, Amit K. Roy-Chowdhury

TL;DR

SHIFT addresses the problem of infant pose estimation under limited labeled data by transferring knowledge from synthetic adult data using an unsupervised domain-adaptation framework. It combines a mean-teacher consistency mechanism, an offline infant manifold pose prior, and a context-aware pose-image alignment module (Kp2Seg) to enforce anatomical plausibility and visual coherence under self-occlusion. The approach provides the first UDA solution for infant pose estimation and demonstrates substantial performance gains over previous UDA methods and even some supervised baselines, highlighting its data-efficient and privacy-friendly potential for neuromotor assessment, safety monitoring, and assistive robotics. Key contributions include the offline infant pose prior implementation via PoseNDF, the Kp2Seg mapping for pose-to-segmentation guidance, and extensive ablations validating the necessity of each component.

Abstract

Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudo-labeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework, we incorporate an infant manifold pose prior. To enhance SHIFT's self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) pose estimation methods by 5% and supervised infant pose estimation methods by a margin of 16%. The project page is available at: https://sarosijbose.github.io/SHIFT.

Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation

TL;DR

SHIFT addresses the problem of infant pose estimation under limited labeled data by transferring knowledge from synthetic adult data using an unsupervised domain-adaptation framework. It combines a mean-teacher consistency mechanism, an offline infant manifold pose prior, and a context-aware pose-image alignment module (Kp2Seg) to enforce anatomical plausibility and visual coherence under self-occlusion. The approach provides the first UDA solution for infant pose estimation and demonstrates substantial performance gains over previous UDA methods and even some supervised baselines, highlighting its data-efficient and privacy-friendly potential for neuromotor assessment, safety monitoring, and assistive robotics. Key contributions include the offline infant pose prior implementation via PoseNDF, the Kp2Seg mapping for pose-to-segmentation guidance, and extensive ablations validating the necessity of each component.

Abstract

Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudo-labeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework, we incorporate an infant manifold pose prior. To enhance SHIFT's self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) pose estimation methods by 5% and supervised infant pose estimation methods by a margin of 16%. The project page is available at: https://sarosijbose.github.io/SHIFT.

Paper Structure

This paper contains 16 sections, 11 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Need for unsupervised domain adaptive infant pose estimation. From left to right keypoint predictions from a baseline adult human pose estimation model kim2022unified, predictions from a SOTA UDA pose estimation model huang2021invariant, and predictions from our method, SHIFT. Adult pose estimation models fail when directly applied to infant data; similarly, UniFrame kim2022unified struggles to overcome the domain shift between adults and infants. In contrast, SHIFT accounts for the highly self-occluded pose distribution of infants, thereby effectively adapting to the infant domain.
  • Figure 2: Framework Overview.SHIFT utilizes the Mean-Teacher framework tarvainen2017mean to update the teacher model $\mathcal{M}_t$ with an Exponential Moving Average (EMA) of the student model $\mathcal{M}_s$'s weights to adapt the model pre-trained on a labeled adult source dataset $(x_s, y_s)$ to unlabeled infant target images $(x_t)$ (\ref{['subsec:mean-teacher']}). To address anatomical variations in infants, SHIFT employs an infant pose prior $\theta_p$ which assigns plausibility scores for each prediction of the student model $\mathcal{M}_s$ (\ref{['subsec:infant-prior']}). Further, to handle the large self-occlusions in the target domain, we employ an off-the-model $F_{seg}$ to give pseudo segmentation masks $p_t$ with which our Kp2Seg module $\mathcal{G(\cdot)}$ learns to perform pose-image visibility alignment (\ref{['subsec:Kp2Seg']}) hence effectively leveraging the context present in the visible portions of each image. All the learnable components of the framework are denoted in red and rest in black.
  • Figure 3: Qualitative results on SURREAL $\rightarrow$ SyRIP (top 3 rows) and SURREAL $\rightarrow$ MINI-RGBD (bottom 2 rows). From left to right: source only keypoints, keypoint predictions by UniFrame, predictions by FiDIP, predictions by SHIFT, and ground truth keypoints. As it can be seen above, the infant prior is essential to predict plausible poses in cases where other methods fail (top row). Further, our method can utilize context from visible regions to predict keypoints in self-occluded areas (2nd and 3rd row) while seamlessly adapting to different scenarios (4th and 5th row). $\bigcirc$ denotes the self-occluded regions in the images.
  • Figure 4: Tackling Self-Occlusions: SURREAL $\rightarrow$ SyRIP. UniFrame prediction (left panel) fails to correctly estimate significant portions of the lower back and left hand of the infant while SHIFT is able to reasonably do so. Ground truth (rightmost panel) and extracted mask (second from left panel) are also shown.