Table of Contents
Fetching ...

Efficient Surgical Robotic Instrument Pose Reconstruction in Real World Conditions Using Unified Feature Detection

Zekai Liang, Kazuya Miyata, Xiao Liang, Florian Richter, Michael C. Yip

TL;DR

This work addresses accurate camera-to-end-effector calibration for minimally invasive surgical robots under real-world conditions where long serial chains and partial visibility hinder traditional methods. It unifies shaft-edge detection and keypoint localization in a single network, training on large-scale synthetic data with ground-truth cylinder edges defined by $A u + B v + C = 0$, and learns a fast geometry-based pose solver that converts features into a 6D pose without iterative refinement. Key contributions include the unified feature-detection architecture (Edge Net and Keypoint Net), a cylinder-based shaft-centerline reconstruction, and a TRF-based optimization for shaft roll to produce $\mathbf{T}_{\mathrm{cam}\rightarrow\mathrm{ee}}$ efficiently. The approach yields state-of-the-art accuracy and millisecond-level inference, enabling robust online control in challenging surgical environments with practical impact for MIS robotics.

Abstract

Accurate camera-to-robot calibration is essential for any vision-based robotic control system and especially critical in minimally invasive surgical robots, where instruments conduct precise micro-manipulations. However, MIS robots have long kinematic chains and partial visibility of their degrees of freedom in the camera, which introduces challenges for conventional camera-to-robot calibration methods that assume stiff robots with good visibility. Previous works have investigated both keypoint-based and rendering-based approaches to address this challenge in real-world conditions; however, they often struggle with consistent feature detection or have long inference times, neither of which are ideal for online robot control. In this work, we propose a novel framework that unifies the detection of geometric primitives (keypoints and shaft edges) through a shared encoding, enabling efficient pose estimation via projection geometry. This architecture detects both keypoints and edges in a single inference and is trained on large-scale synthetic data with projective labeling. This method is evaluated across both feature detection and pose estimation, with qualitative and quantitative results demonstrating fast performance and state-of-the-art accuracy in challenging surgical environments.

Efficient Surgical Robotic Instrument Pose Reconstruction in Real World Conditions Using Unified Feature Detection

TL;DR

This work addresses accurate camera-to-end-effector calibration for minimally invasive surgical robots under real-world conditions where long serial chains and partial visibility hinder traditional methods. It unifies shaft-edge detection and keypoint localization in a single network, training on large-scale synthetic data with ground-truth cylinder edges defined by , and learns a fast geometry-based pose solver that converts features into a 6D pose without iterative refinement. Key contributions include the unified feature-detection architecture (Edge Net and Keypoint Net), a cylinder-based shaft-centerline reconstruction, and a TRF-based optimization for shaft roll to produce efficiently. The approach yields state-of-the-art accuracy and millisecond-level inference, enabling robust online control in challenging surgical environments with practical impact for MIS robotics.

Abstract

Accurate camera-to-robot calibration is essential for any vision-based robotic control system and especially critical in minimally invasive surgical robots, where instruments conduct precise micro-manipulations. However, MIS robots have long kinematic chains and partial visibility of their degrees of freedom in the camera, which introduces challenges for conventional camera-to-robot calibration methods that assume stiff robots with good visibility. Previous works have investigated both keypoint-based and rendering-based approaches to address this challenge in real-world conditions; however, they often struggle with consistent feature detection or have long inference times, neither of which are ideal for online robot control. In this work, we propose a novel framework that unifies the detection of geometric primitives (keypoints and shaft edges) through a shared encoding, enabling efficient pose estimation via projection geometry. This architecture detects both keypoints and edges in a single inference and is trained on large-scale synthetic data with projective labeling. This method is evaluated across both feature detection and pose estimation, with qualitative and quantitative results demonstrating fast performance and state-of-the-art accuracy in challenging surgical environments.

Paper Structure

This paper contains 16 sections, 27 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Pose reconstruction comparison between our framework and differentiable rendering based method. The skeleton overlay is obtained by estimated pose and forward kinematics.
  • Figure 2: The overview of the proposed framework. Keypoint Net and Edge Net are jointly trained on large-scale synthetic data using heatmap regression with a shared encoder. During inference, the detected keypoints and shaft edges are passed to a geometric pose solver, which leverages the robot’s projective constraints to efficiently estimate the full 6D pose.
  • Figure 3: Synthetic training data generated with ground truth shaft edges and keypoint annotations (Outer Roll, Wrist Yaw and Tool Tips).
  • Figure 4: We apply a pixel-level edge refinement to the output of Edge Net using Line Segment Detector to achieve a more accurate shaft estimation.
  • Figure 5: Qualitative comparison of feature detection results between our and prior models. Prior models follow the same implementation as in the original papers.