Table of Contents
Fetching ...

Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization

Hanyang Hu, Zekai Liang, Florian Richter, Michael C. Yip

Abstract

Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.

Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization

Abstract

Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.
Paper Structure (13 sections, 15 equations, 5 figures, 2 tables)

This paper contains 13 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Skeleton overlays of the top-$5$ CMA-ES samples across successive iterations. At each iteration, CMA-ES draws a population of candidate poses from a Gaussian distribution, evaluates their fitness using render-and-match objectives, and updates the distribution toward better solutions. Within 3 iterations, the sampled poses concentrate around the correct alignment.
  • Figure 2: Overview of the proposed framework. Given RGB video frames, segmentation masks and tool-tip detections are produced to define a render-and-match objective optimized via CMA-ES. At each iteration, pose candidates are sampled from the current distribution, evaluated in parallel through batched forward kinematics and rendering, and ranked by the objective to update the sampling distribution. The optimized estimates are temporally filtered and propagated to initialize the next frame. In this framework, three joint angles are optimized: wrist pitch $q_1$, wrist yaw $q_2$, and jaw angle $q_3$. The end-effector pose is defined at the wrist pitch frame. A look-at camera representation is adopted to decouple the shaft rotation $\beta$ from the remaining rotational components.
  • Figure 3: Qualitative comparison of bi-manual tracking on the SurgPose dataset. Reference masks are shown with green overlays. The proposed method achieves accurate pose reconstruction when joint angle readings are available and remains robust to poor initialization even without joint measurements, while the gradient-based approach is prone to local minima and error accumulation.
  • Figure 4: Qualitative comparison of single-arm tracking on the collected dataset. Detected colored markers for the particle filter are shown as green dots (centroids of the regions outlined by red contours). Reference masks for the proposed approach are shown with green overlays. Our method demonstrates improved accuracy in the alignment of tool tips compared to the particle filter.
  • Figure 5: Qualitative comparison of pose reconstruction results using different optimization strategies on a synthetic trajectory.