Table of Contents
Fetching ...

Methods and datasets for segmentation of minimally invasive surgical instruments in endoscopic images and videos: A review of the state of the art

Tobias Rueckert, Daniel Rueckert, Christoph Palm

TL;DR

This review surveys state-of-the-art, marker-free segmentation of surgical instruments in endoscopic data, comparing semantic and instance-level methods and their use of temporal information. It catalogs publicly available datasets (notably EndoVis series and related public resources) and analyzes how these datasets influence method development, including domain adaptation and synthetic data approaches. The authors identify a strong reliance on supervised learning, limited cross-domain generalization, and a growing role for attention and temporal models, while highlighting reproducibility and data-access challenges. The work emphasizes real-time applicability and proposes future directions such as synthetic data, VR-based datasets, and robust domain-adaptive frameworks to accelerate clinically deployable RAMIS instrument segmentation.

Abstract

In the field of computer- and robot-assisted minimally invasive surgery, enormous progress has been made in recent years based on the recognition of surgical instruments in endoscopic images and videos. In particular, the determination of the position and type of instruments is of great interest. Current work involves both spatial and temporal information, with the idea that predicting the movement of surgical tools over time may improve the quality of the final segmentations. The provision of publicly available datasets has recently encouraged the development of new methods, mainly based on deep learning. In this review, we identify and characterize datasets used for method development and evaluation and quantify their frequency of use in the literature. We further present an overview of the current state of research regarding the segmentation and tracking of minimally invasive surgical instruments in endoscopic images and videos. The paper focuses on methods that work purely visually, without markers of any kind attached to the instruments, considering both single-frame semantic and instance segmentation approaches, as well as those that incorporate temporal information. The publications analyzed were identified through the platforms Google Scholar, Web of Science, and PubMed. The search terms used were "instrument segmentation", "instrument tracking", "surgical tool segmentation", and "surgical tool tracking", resulting in a total of 741 articles published between 01/2015 and 07/2023, of which 123 were included using systematic selection criteria. A discussion of the reviewed literature is provided, highlighting existing shortcomings and emphasizing the available potential for future developments.

Methods and datasets for segmentation of minimally invasive surgical instruments in endoscopic images and videos: A review of the state of the art

TL;DR

This review surveys state-of-the-art, marker-free segmentation of surgical instruments in endoscopic data, comparing semantic and instance-level methods and their use of temporal information. It catalogs publicly available datasets (notably EndoVis series and related public resources) and analyzes how these datasets influence method development, including domain adaptation and synthetic data approaches. The authors identify a strong reliance on supervised learning, limited cross-domain generalization, and a growing role for attention and temporal models, while highlighting reproducibility and data-access challenges. The work emphasizes real-time applicability and proposes future directions such as synthetic data, VR-based datasets, and robust domain-adaptive frameworks to accelerate clinically deployable RAMIS instrument segmentation.

Abstract

In the field of computer- and robot-assisted minimally invasive surgery, enormous progress has been made in recent years based on the recognition of surgical instruments in endoscopic images and videos. In particular, the determination of the position and type of instruments is of great interest. Current work involves both spatial and temporal information, with the idea that predicting the movement of surgical tools over time may improve the quality of the final segmentations. The provision of publicly available datasets has recently encouraged the development of new methods, mainly based on deep learning. In this review, we identify and characterize datasets used for method development and evaluation and quantify their frequency of use in the literature. We further present an overview of the current state of research regarding the segmentation and tracking of minimally invasive surgical instruments in endoscopic images and videos. The paper focuses on methods that work purely visually, without markers of any kind attached to the instruments, considering both single-frame semantic and instance segmentation approaches, as well as those that incorporate temporal information. The publications analyzed were identified through the platforms Google Scholar, Web of Science, and PubMed. The search terms used were "instrument segmentation", "instrument tracking", "surgical tool segmentation", and "surgical tool tracking", resulting in a total of 741 articles published between 01/2015 and 07/2023, of which 123 were included using systematic selection criteria. A discussion of the reviewed literature is provided, highlighting existing shortcomings and emphasizing the available potential for future developments.
Paper Structure (51 sections, 10 figures, 10 tables)

This paper contains 51 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Number of relevant publications by year resulting from searches on the platforms Web of Science, PubMed, and Google Scholar, according to the selection criteria. For 2023, in addition to the publications up to and including July, the numbers of expected papers by the end of the year are shown in italics at the top of the transparent boxes.
  • Figure 2: Visualization of commonly used datasets along with their frequency of use in recent publications.
  • Figure 3: Example images from frequently used endoscopic datasets, according to the upper part of Table \ref{['tab:datasets_usages_tab']}. Images are taken from the EndoVis-2015 *datasets:example_images_endovis15_rigid, EndoVis-2017 *datasets:example_images_endovis17, EndoVis-2018 *datasets:example_images_endovis18, EndoVis-2019 *datasets:example_images_endovis19, Lap. I2I Translation *datasets:example_images_lap_i2i, Sinus-Surgery-C/L *datasets:example_images_sinus_surgery, UCL dVRK *datasets:example_images_ucl_dvrk, Kvasir Instruments *datasets:example_images_kvasir, and RoboTool *datasets:example_images_robotool_robotic datasets.
  • Figure 4: Structure of Section \ref{['semantic_segmentation_methods']} regarding semantic segmentation methods, to be read from left to right. Indicated are the topics by which the identified publications are grouped. The methods of both single-frame segmentation (Section \ref{['semantic_segmentation:single_frame_seg']}) and those involving temporal information (Section \ref{['semantic_segmentation:incorporating_temporal_information']}) are divided by segmentation type (Sections \ref{['semantic_segmentation:single_frame_seg:seg_type']} and \ref{['semantic_segmentation:temporal_info:segmentation_type']}), learning strategy (Sections \ref{['semantic_segmentation:single_frame_seg:learning_strategy']} and \ref{['semantic_segmentation:temporal_info:learning_strategy']}), inference speed (Sections \ref{['semantic_segmentation:single_frame_seg:inference_speed']} and \ref{['semantic_segmentation:temporal_info:inference_speed']}), and attention mechanisms (Sections \ref{['semantic_segmentation:single_frame_seg:attention_mechanisms']} and \ref{['semantic_segmentation:temporal_info:attention_mechanisms']}). For single-frame segmentation, methods of domain adaptation (\ref{['semantic_segmentation:single_frame_seg:domain_adaptation']}) are further presented, and for publications with temporal information tracking approaches (Section \ref{['semantic_segmentation:temporal_info:domain_adaptation']}) are explained.
  • Figure 5: Number of relevant publications per segmentation type for semantic single image segmentation.
  • ...and 5 more figures