Efficient Surgical Robotic Instrument Pose Reconstruction in Real World Conditions Using Unified Feature Detection
Zekai Liang, Kazuya Miyata, Xiao Liang, Florian Richter, Michael C. Yip
TL;DR
This work addresses accurate camera-to-end-effector calibration for minimally invasive surgical robots under real-world conditions where long serial chains and partial visibility hinder traditional methods. It unifies shaft-edge detection and keypoint localization in a single network, training on large-scale synthetic data with ground-truth cylinder edges defined by $A u + B v + C = 0$, and learns a fast geometry-based pose solver that converts features into a 6D pose without iterative refinement. Key contributions include the unified feature-detection architecture (Edge Net and Keypoint Net), a cylinder-based shaft-centerline reconstruction, and a TRF-based optimization for shaft roll to produce $\mathbf{T}_{\mathrm{cam}\rightarrow\mathrm{ee}}$ efficiently. The approach yields state-of-the-art accuracy and millisecond-level inference, enabling robust online control in challenging surgical environments with practical impact for MIS robotics.
Abstract
Accurate camera-to-robot calibration is essential for any vision-based robotic control system and especially critical in minimally invasive surgical robots, where instruments conduct precise micro-manipulations. However, MIS robots have long kinematic chains and partial visibility of their degrees of freedom in the camera, which introduces challenges for conventional camera-to-robot calibration methods that assume stiff robots with good visibility. Previous works have investigated both keypoint-based and rendering-based approaches to address this challenge in real-world conditions; however, they often struggle with consistent feature detection or have long inference times, neither of which are ideal for online robot control. In this work, we propose a novel framework that unifies the detection of geometric primitives (keypoints and shaft edges) through a shared encoding, enabling efficient pose estimation via projection geometry. This architecture detects both keypoints and edges in a single inference and is trained on large-scale synthetic data with projective labeling. This method is evaluated across both feature detection and pose estimation, with qualitative and quantitative results demonstrating fast performance and state-of-the-art accuracy in challenging surgical environments.
