Table of Contents
Fetching ...

SurgPose: a Dataset for Articulated Robotic Surgical Tool Pose Estimation and Tracking

Zijian Wu, Adam Schmidt, Randy Moore, Haoying Zhou, Alexandre Banks, Peter Kazanzides, Septimiu E. Salcudean

TL;DR

SurgPose addresses the critical need for real-world data to train and evaluate articulated surgical tool pose estimation by introducing a stereo, multi-instrument dataset with instance-aware keypoints and skeletons labeled using UV-reactive markers. The labeling pipeline leverages UV fluorescence and SAMv2-based semi-supervised annotation, enabling efficient, scalable keypoint production while preserving image fidelity. The dataset supports 2D pose estimation and 3D pose lifting via stereo depth, and includes diverse instrument types, ex vivo backgrounds, and 1001-step trajectories, accompanied by baseline evaluations of modern pose-estimation methods. While promising, the authors acknowledge limitations in distribution and transferability to clinical settings, pointing to future work in expanding instrument diversity, in vivo data, and multi-modal integration, with potential benefits for few-shot learning and foundation-model adaptation in surgical perception.

Abstract

Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints and skeletons for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120k surgical instrument instances (80k for training and 40k for validation) of 6 categories. Each instrument instance is labeled with 7 semantic keypoints. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we test a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.

SurgPose: a Dataset for Articulated Robotic Surgical Tool Pose Estimation and Tracking

TL;DR

SurgPose addresses the critical need for real-world data to train and evaluate articulated surgical tool pose estimation by introducing a stereo, multi-instrument dataset with instance-aware keypoints and skeletons labeled using UV-reactive markers. The labeling pipeline leverages UV fluorescence and SAMv2-based semi-supervised annotation, enabling efficient, scalable keypoint production while preserving image fidelity. The dataset supports 2D pose estimation and 3D pose lifting via stereo depth, and includes diverse instrument types, ex vivo backgrounds, and 1001-step trajectories, accompanied by baseline evaluations of modern pose-estimation methods. While promising, the authors acknowledge limitations in distribution and transferability to clinical settings, pointing to future work in expanding instrument diversity, in vivo data, and multi-modal integration, with potential benefits for few-shot learning and foundation-model adaptation in surgical perception.

Abstract

Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints and skeletons for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120k surgical instrument instances (80k for training and 40k for validation) of 6 categories. Each instrument instance is labeled with 7 semantic keypoints. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we test a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Keypoint definition for the large needle driver. The instrument pose is represented with a skeleton capturing its joints and links. Similarly to the large needle driver, all instruments in SurgPose have 5 keypoint (1-5) skeletons. Keypoints 6 and 7 are redundant, but may be helpful when conducting tracking.
  • Figure 2: (a) The miniature paint brushes are used for marking keypoints. (b) The UV reactive paint under the white light, the left and right are red and green, respectively. (c) The UV reactive paint under the black light.
  • Figure 3: Experimental setup. PSM 1 and 3 are controlled by the dVRK. The ECM is always static during the data collection. When the surgical lamp is turned on, black lights are turned off, and vice versa.
  • Figure 4: The GLCM features of the image patches around the keypoints.
  • Figure 5: Comparison of the tool appearance before and after marking by UV reactive paint. The top row to the bottom are frames before marking, frames after marking, the difference between them, and the frames under the UV light, respectively.