Temporally Guided Articulated Hand Pose Tracking in Surgical Videos
Nathan Louis, Luowei Zhou, Steven J. Yule, Roger D. Dias, Milisa Manojlovich, Francis D. Pagani, Donald S. Likosky, Jason J. Corso
TL;DR
Articulated hand pose tracking in surgical videos is challenging due to occlusion and rapid motion. The authors introduce CondPose, a temporally guided pose estimator that conditions current hand predictions on prior frame heatmaps via an attention-based fusion mechanism, and they release Surgical Hands, the first multi-instance hand pose tracking dataset in surgery. Across surgical data and PoseTrack18, CondPose achieves higher mAP for detection and higher MOTA for tracking, with ablations confirming the usefulness of temporal priors for localization. The work provides a practical route to more accurate hand representations in surgery, with potential downstream benefits for skill assessment and action understanding.
Abstract
Articulated hand pose tracking is an under-explored problem that carries the potential for use in an extensive number of applications, especially in the medical domain. With a robust and accurate tracking system on surgical videos, the motion dynamics and movement patterns of the hands can be captured and analyzed for many rich tasks. In this work, we propose a novel hand pose estimation model, CondPose, which improves detection and tracking accuracy by incorporating a pose prior into its prediction. We show improvements over state-of-the-art methods which provide frame-wise independent predictions, by following a temporally guided approach that effectively leverages past predictions. We collect Surgical Hands, the first dataset that provides multi-instance articulated hand pose annotations for videos. Our dataset provides over 8.1k annotated hand poses from publicly available surgical videos and bounding boxes, pose annotations, and tracking IDs to enable multi-instance tracking. When evaluated on Surgical Hands, we show our method outperforms the state-of-the-art approach using mean Average Precision (mAP), to measure pose estimation accuracy, and Multiple Object Tracking Accuracy (MOTA), to assess pose tracking performance. In comparison to a frame-wise independent strategy, we show greater performance in detecting and tracking hand poses and more substantial impact on localization accuracy. This has positive implications in generating more accurate representations of hands in the scene to be used for targeted downstream tasks.
