ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection
Zijian Wu, Shuojue Yang, Yueming Jin, Septimiu E Salcudean
TL;DR
This work tackles the challenge of locating surgical instrument tips directly in laparoscopic images to enable robust registration with ultrasound frames, addressing inaccuracies in existing da Vinci API data. It introduces ToolTipNet, a segmentation-driven baseline that takes part-level instrument masks—derived from segmentation foundation models like Segment Anything—and uses a high-resolution HRNet backbone with mask-guided attention to predict tip heatmaps. Empirical results on simulated and real datasets show ToolTipNet outperforming a hand-crafted SVD baseline in RMSE and accuracy, illustrating the value of mask priors for precise tip localization, though real-data performance remains more challenging. The approach offers a scalable, caption-friendly baseline for tool tip localization that can support registration, skill assessment, and autonomous surgical tasks, with future work aiming to reduce segmentation dependency via multi-task learning and backbone integration.
Abstract
In robot-assisted laparoscopic radical prostatectomy (RALP), the location of the instrument tip is important to register the ultrasound frame with the laparoscopic camera frame. A long-standing limitation is that the instrument tip position obtained from the da Vinci API is inaccurate and requires hand-eye calibration. Thus, directly computing the position of the tool tip in the camera frame using the vision-based method becomes an attractive solution. Besides, surgical instrument tip detection is the key component of other tasks, like surgical skill assessment and surgery automation. However, this task is challenging due to the small size of the tool tip and the articulation of the surgical instrument. Surgical instrument segmentation becomes relatively easy due to the emergence of the Segmentation Foundation Model, i.e., Segment Anything. Based on this advancement, we explore the deep learning-based surgical instrument tip detection approach that takes the part-level instrument segmentation mask as input. Comparison experiments with a hand-crafted image-processing approach demonstrate the superiority of the proposed method on simulated and real datasets.
