Table of Contents
Fetching ...

Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

Jonas Hein, Lilian Calvet, Matthias Seibold, Siyu Tang, Marc Pollefeys, Philipp Fürnstahl

Abstract

Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

Abstract

Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

Paper Structure

This paper contains 8 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our proposed pipeline. We first detect surgical instruments from multiple views using the detection stage outlined in \ref{['sec:detection']}. The 6D pose of all detected instances is then estimated using the pose estimation stage presented in \ref{['sec:pose_estimation']}.
  • Figure 2: Comparison of pose estimate error distributions from the supervised and training-free approaches. To enable a fair comparison to the supervised baseline, we use ground-truth detections in form of modal object masks as input to our pose estimation stage. All comparisons are done on 5 input views of the MVPSP wetlab test set (left) and the OR-X test set (right).
  • Figure 3: Comparison of pose estimate error distributions depending on the instrument detections. We compare our detection stage against the SAM2 with a scoring oracle and the ground-truth detections with 5 input views on the MVPSP OR-X test set. Dashed lines indicate results of MVFP without refinement; solid lines of the same color show the results after contour-based refinement.
  • Figure 4: Qualitative results from the detection and pose estimation stages. The left image shows an exemplary instrument detection of our 5-view model. The center image shows the extraction of unoccluded contour points, where the outline of the pose hypothesis is rendered in orange and the extracted contour points are highlighted in blue. Note the absence of contour points along the hand and the scrubs on the bottom of the image. An exemplary result of the final pose estimate of our pipeline with five views and on ground-truth detections is displayed in the right image.
  • Figure 5: Representative failure cases from the masklet generation (left), masklet classification (middle), and coarse pose estimation step (right). We superimpose the detections with their predicted masklets and their bounding boxes in blue and black. For the pose estimation, we superimpose extracted contour points in blue and the estimated instrument pose as an orange outline. The coarse pose estimate is flipped along the instruments axis and unrecoverable for the refinement stage.