HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation
Manuel Birlo, Razvan Caramalau, Philip J. "Eddie" Edwards, Brian Dromey, Matthew J. Clarkson, Danail Stoyanov
TL;DR
HUP-3D addresses the lack of large-scale, labeled data for egocentric hand–probe pose estimation in obstetric ultrasound by introducing a scalable, synthetic multi-view dataset. The authors propose a two‑stage grasp generation and Blender‑based rendering pipeline, augmented by a novel sphere camera concept to capture both egocentric and non‑egocentric viewpoints, producing RGB, depth, and segmentation maps with ground-truth 3D poses. Evaluation with HOPE‑net on 31,680 frames across 11 grasps yields a state‑of‑the‑art MPJPE of 8.65 mm (hand 5.33 mm, object 17.05 mm), outperforming prior clinical datasets and a ResNet baseline. The work enables improved training for egocentric hand–probe pose estimation and has practical implications for mixed reality medical education and standardized ultrasound guidance, with future plans for real-image augmentation and temporal modeling.
Abstract
We present HUP-3D, a 3D multi-view multi-modal synthetic dataset for hand-ultrasound (US) probe pose estimation in the context of obstetric ultrasound. Egocentric markerless 3D joint pose estimation has potential applications in mixed reality based medical education. The ability to understand hand and probe movements programmatically opens the door to tailored guidance and mentoring applications. Our dataset consists of over 31k sets of RGB, depth and segmentation mask frames, including pose related ground truth data, with a strong emphasis on image diversity and complexity. Adopting a camera viewpoint-based sphere concept allows us to capture a variety of views and generate multiple hand grasp poses using a pre-trained network. Additionally, our approach includes a software-based image rendering concept, enhancing diversity with various hand and arm textures, lighting conditions, and background images. Furthermore, we validated our proposed dataset with state-of-the-art learning models and we obtained the lowest hand-object keypoint errors. The dataset and other details are provided with the supplementary material. The source code of our grasp generation and rendering pipeline will be made publicly available.
