Table of Contents
Fetching ...

HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation

Manuel Birlo, Razvan Caramalau, Philip J. "Eddie" Edwards, Brian Dromey, Matthew J. Clarkson, Danail Stoyanov

TL;DR

HUP-3D addresses the lack of large-scale, labeled data for egocentric hand–probe pose estimation in obstetric ultrasound by introducing a scalable, synthetic multi-view dataset. The authors propose a two‑stage grasp generation and Blender‑based rendering pipeline, augmented by a novel sphere camera concept to capture both egocentric and non‑egocentric viewpoints, producing RGB, depth, and segmentation maps with ground-truth 3D poses. Evaluation with HOPE‑net on 31,680 frames across 11 grasps yields a state‑of‑the‑art MPJPE of 8.65 mm (hand 5.33 mm, object 17.05 mm), outperforming prior clinical datasets and a ResNet baseline. The work enables improved training for egocentric hand–probe pose estimation and has practical implications for mixed reality medical education and standardized ultrasound guidance, with future plans for real-image augmentation and temporal modeling.

Abstract

We present HUP-3D, a 3D multi-view multi-modal synthetic dataset for hand-ultrasound (US) probe pose estimation in the context of obstetric ultrasound. Egocentric markerless 3D joint pose estimation has potential applications in mixed reality based medical education. The ability to understand hand and probe movements programmatically opens the door to tailored guidance and mentoring applications. Our dataset consists of over 31k sets of RGB, depth and segmentation mask frames, including pose related ground truth data, with a strong emphasis on image diversity and complexity. Adopting a camera viewpoint-based sphere concept allows us to capture a variety of views and generate multiple hand grasp poses using a pre-trained network. Additionally, our approach includes a software-based image rendering concept, enhancing diversity with various hand and arm textures, lighting conditions, and background images. Furthermore, we validated our proposed dataset with state-of-the-art learning models and we obtained the lowest hand-object keypoint errors. The dataset and other details are provided with the supplementary material. The source code of our grasp generation and rendering pipeline will be made publicly available.

HUP-3D: A 3D multi-view synthetic dataset for assisted-egocentric hand-ultrasound pose estimation

TL;DR

HUP-3D addresses the lack of large-scale, labeled data for egocentric hand–probe pose estimation in obstetric ultrasound by introducing a scalable, synthetic multi-view dataset. The authors propose a two‑stage grasp generation and Blender‑based rendering pipeline, augmented by a novel sphere camera concept to capture both egocentric and non‑egocentric viewpoints, producing RGB, depth, and segmentation maps with ground-truth 3D poses. Evaluation with HOPE‑net on 31,680 frames across 11 grasps yields a state‑of‑the‑art MPJPE of 8.65 mm (hand 5.33 mm, object 17.05 mm), outperforming prior clinical datasets and a ResNet baseline. The work enables improved training for egocentric hand–probe pose estimation and has practical implications for mixed reality medical education and standardized ultrasound guidance, with future plans for real-image augmentation and temporal modeling.

Abstract

We present HUP-3D, a 3D multi-view multi-modal synthetic dataset for hand-ultrasound (US) probe pose estimation in the context of obstetric ultrasound. Egocentric markerless 3D joint pose estimation has potential applications in mixed reality based medical education. The ability to understand hand and probe movements programmatically opens the door to tailored guidance and mentoring applications. Our dataset consists of over 31k sets of RGB, depth and segmentation mask frames, including pose related ground truth data, with a strong emphasis on image diversity and complexity. Adopting a camera viewpoint-based sphere concept allows us to capture a variety of views and generate multiple hand grasp poses using a pre-trained network. Additionally, our approach includes a software-based image rendering concept, enhancing diversity with various hand and arm textures, lighting conditions, and background images. Furthermore, we validated our proposed dataset with state-of-the-art learning models and we obtained the lowest hand-object keypoint errors. The dataset and other details are provided with the supplementary material. The source code of our grasp generation and rendering pipeline will be made publicly available.
Paper Structure (12 sections, 2 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 2 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Grasp Generation (blue) and Rendering Pipeline (red): The process begins with a MANO hand model initialization and a BPS-encoded Voluson model point cloud. CoarseNet generates initial hand poses, further refined by RefineNet for precise hand-probe alignment. In the rendering phase, the optimized hand pose, model vertices, and a SMPL-H model are processed in Blender. Using a multi-viewpoint camera via a spherical layout and centered on the hand and arm, several textures and backgrounds are applied for diverse RGB-D, segmentation maps, and annotations.
  • Figure 2: (a) Schematic grasp conversion from generative model to rendering software, including probe offset ($\Delta z$) correction. (b) Grasp rendering overview: (1) SMPL-H body model grasping the probe, showing egocentric and non-egocentric views. (2) Right arm and sphere-based camera orientations with remaining SMPL-H body parts hidden. (3) Camera angle sphere concept with views at various latitudes, centered on hand mesh; defines sphere ($r_{sphr}$) and circle ($r_{circ}$) radii. (4) Rendered hand-probe scene example from a sphere camera position.
  • Figure 3: Sample frames from the HUP-3D dataset, grouped columnwise, from left to right: RGB, depth, segmentation map, and ground truth annotations.
  • Figure 4: Qualitative results, shown with 4 test images from HUP-3D: image columns from left to right: RGB, predicted hand joints, predicted probe corners, predicted joints and corners, ground truth of joints and corners