Table of Contents
Fetching ...

Weakly Semi-supervised Tool Detection in Minimally Invasive Surgery Videos

Ryo Fujii, Ryo Hachiuma, Hideo Saito

TL;DR

The paper tackles the high annotation cost of surgical tool detection by proposing a weakly semi-supervised framework that leverages a small fully labeled set and a large image-level labeled set. It introduces a refinement network trained with multiple instance learning to adjust pseudo-labels produced by a teacher model, and augments this with a co-occurrence loss that encodes tool-pair co-occurrence priors observed in MIS videos. Empirical results on Endovis2018 show that the approach substantially improves mean average precision over baselines with similar annotation budgets, approaching fully supervised performance, and the co-occurrence term yields additional gains. The method is practical, integrates with standard detectors like Faster R-CNN with FPN, and offers a scalable path toward accurate tool detection with reduced labeling effort.

Abstract

Surgical tool detection is essential for analyzing and evaluating minimally invasive surgery videos. Current approaches are mostly based on supervised methods that require large, fully instance-level labels (i.e., bounding boxes). However, large image datasets with instance-level labels are often limited because of the burden of annotation. Thus, surgical tool detection is important when providing image-level labels instead of instance-level labels since image-level annotations are considerably more time-efficient than instance-level annotations. In this work, we propose to strike a balance between the extremely costly annotation burden and detection performance. We further propose a co-occurrence loss, which considers a characteristic that some tool pairs often co-occur together in an image to leverage image-level labels. Encapsulating the knowledge of co-occurrence using the co-occurrence loss helps to overcome the difficulty in classification that originates from the fact that some tools have similar shapes and textures. Extensive experiments conducted on the Endovis2018 dataset in various data settings show the effectiveness of our method.

Weakly Semi-supervised Tool Detection in Minimally Invasive Surgery Videos

TL;DR

The paper tackles the high annotation cost of surgical tool detection by proposing a weakly semi-supervised framework that leverages a small fully labeled set and a large image-level labeled set. It introduces a refinement network trained with multiple instance learning to adjust pseudo-labels produced by a teacher model, and augments this with a co-occurrence loss that encodes tool-pair co-occurrence priors observed in MIS videos. Empirical results on Endovis2018 show that the approach substantially improves mean average precision over baselines with similar annotation budgets, approaching fully supervised performance, and the co-occurrence term yields additional gains. The method is practical, integrates with standard detectors like Faster R-CNN with FPN, and offers a scalable path toward accurate tool detection with reduced labeling effort.

Abstract

Surgical tool detection is essential for analyzing and evaluating minimally invasive surgery videos. Current approaches are mostly based on supervised methods that require large, fully instance-level labels (i.e., bounding boxes). However, large image datasets with instance-level labels are often limited because of the burden of annotation. Thus, surgical tool detection is important when providing image-level labels instead of instance-level labels since image-level annotations are considerably more time-efficient than instance-level annotations. In this work, we propose to strike a balance between the extremely costly annotation burden and detection performance. We further propose a co-occurrence loss, which considers a characteristic that some tool pairs often co-occur together in an image to leverage image-level labels. Encapsulating the knowledge of co-occurrence using the co-occurrence loss helps to overcome the difficulty in classification that originates from the fact that some tools have similar shapes and textures. Extensive experiments conducted on the Endovis2018 dataset in various data settings show the effectiveness of our method.
Paper Structure (15 sections, 7 equations, 4 figures, 1 table)

This paper contains 15 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the proposed framework. The white arrows represent the training stage, and the black arrows represent the pseudo-label generation stage.
  • Figure 2: Details of training and refinement procedure of a refinement model corresponding to step 3 and step 4 in our framework.
  • Figure 3: Comparison in mAPs of the student model (i.e. Faster-RCNN) for different supervision on Endvis2018. 'Supervised' and 'Semi-Supervised' refer to the student models trained on labeled data only and trained on labeled data and pseudo-labels obtained from a teacher model without refinement, respectively.
  • Figure 4: Qualitative results of the object detection with different supervision methods. The colors of the bounding boxes denote the estimated category of the tools.