Table of Contents
Fetching ...

Comparison of marker-less 2D image-based methods for infant pose estimation

Lennart Jahn, Sarah Flügge, Dajie Zhang, Luise Poustka, Sven Bölte, Florentin Wörgötter, Peter B Marschik, Tomas Kulvicius

TL;DR

This study tackles automated infant GMA by systematically comparing four generic pose estimators and two infant-specific models on a multi-view infant dataset. ViTPose, the top-performing generic model on COCO, also leads on infant data, and retraining ViTPose on the study's infant dataset markedly improves pose accuracy, especially for challenging hips. Across viewing angles, a top-down view consistently yields better pose estimation than the conventional diagonal view, underscoring occlusion-driven limitations in diagonal setups. Infant-specific estimators show limited generalization to new data, with retrained ViTPose achieving superior performance overall. The findings advocate using a top-down recording setup and prioritizing retraining on the target dataset; if retraining is not feasible, a strong generic estimator should be selected over specialized infant models trained on different data, to advance automated GMA and related infant movement analysis.

Abstract

In this study we compare the performance of available generic- and infant-pose estimators for a video-based automated general movement assessment (GMA), and the choice of viewing angle for optimal recordings, i.e., conventional diagonal view used in GMA vs. top-down view. We used 4500 annotated video-frames from 75 recordings of infant spontaneous motor functions from 4 to 26 weeks. To determine which pose estimation method and camera angle yield the best pose estimation accuracy on infants in a GMA related setting, the distance to human annotations and the percentage of correct key-points (PCK) were computed and compared. The results show that the best performing generic model trained on adults, ViTPose, also performs best on infants. We see no improvement from using infant-pose estimators over the generic pose estimators on our infant dataset. However, when retraining a generic model on our data, there is a significant improvement in pose estimation accuracy. The pose estimation accuracy obtained from the top-down view is significantly better than that obtained from the diagonal view, especially for the detection of the hip key-points. The results also indicate limited generalization capabilities of infant-pose estimators to other infant datasets, which hints that one should be careful when choosing infant pose estimators and using them on infant datasets which they were not trained on. While the standard GMA method uses a diagonal view for assessment, pose estimation accuracy significantly improves using a top-down view. This suggests that a top-down view should be included in recording setups for automated GMA research.

Comparison of marker-less 2D image-based methods for infant pose estimation

TL;DR

This study tackles automated infant GMA by systematically comparing four generic pose estimators and two infant-specific models on a multi-view infant dataset. ViTPose, the top-performing generic model on COCO, also leads on infant data, and retraining ViTPose on the study's infant dataset markedly improves pose accuracy, especially for challenging hips. Across viewing angles, a top-down view consistently yields better pose estimation than the conventional diagonal view, underscoring occlusion-driven limitations in diagonal setups. Infant-specific estimators show limited generalization to new data, with retrained ViTPose achieving superior performance overall. The findings advocate using a top-down recording setup and prioritizing retraining on the target dataset; if retraining is not feasible, a strong generic estimator should be selected over specialized infant models trained on different data, to advance automated GMA and related infant movement analysis.

Abstract

In this study we compare the performance of available generic- and infant-pose estimators for a video-based automated general movement assessment (GMA), and the choice of viewing angle for optimal recordings, i.e., conventional diagonal view used in GMA vs. top-down view. We used 4500 annotated video-frames from 75 recordings of infant spontaneous motor functions from 4 to 26 weeks. To determine which pose estimation method and camera angle yield the best pose estimation accuracy on infants in a GMA related setting, the distance to human annotations and the percentage of correct key-points (PCK) were computed and compared. The results show that the best performing generic model trained on adults, ViTPose, also performs best on infants. We see no improvement from using infant-pose estimators over the generic pose estimators on our infant dataset. However, when retraining a generic model on our data, there is a significant improvement in pose estimation accuracy. The pose estimation accuracy obtained from the top-down view is significantly better than that obtained from the diagonal view, especially for the detection of the hip key-points. The results also indicate limited generalization capabilities of infant-pose estimators to other infant datasets, which hints that one should be careful when choosing infant pose estimators and using them on infant datasets which they were not trained on. While the standard GMA method uses a diagonal view for assessment, pose estimation accuracy significantly improves using a top-down view. This suggests that a top-down view should be included in recording setups for automated GMA research.
Paper Structure (41 sections, 2 equations, 6 figures, 2 tables)

This paper contains 41 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the recording setup a) and its output b) and c). The cameras recording the infants are circled in red. For this study, only the two cameras labeled diagonal / top view were used. Panels b) and c) show example frames for infants of different age and pose complexity from the two different views. The extracted pose keypoints are displayed as skeletons over the image. Note that neither the human annotators nor any of the pose estimators could reliably determine the position of the fully covered ear in the rightmost example.
  • Figure 2: Difference between two annotators, additionally split by viewing angle. Error bars represent confidence intervals of mean (95%).
  • Figure 3: Difference to annotation $d_a$ in pixels for different subjects, evaluated on our dataset and grouped by key point. Error bars represent confidence intervals of mean (95%).
  • Figure 4: Pose estimation errors for the generic pose estimation models, split by viewing angle. Error bars represent confidence intervals of mean (95%).
  • Figure 5: Mean difference from annotation ($d_a$) for retrained ViTPose models separately evaluated on diagonal or top view images. Different models have been trained on all, only diagonal view or only top view images, respectively. Error bars represent confidence intervals of mean (95%).
  • ...and 1 more figures