Table of Contents
Fetching ...

Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

Mingyu Sheng, Jianan Fan, Dongnan Liu, Ron Kikinis, Weidong Cai

TL;DR

This work proposes an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts.

Abstract

Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-the-shelf approach in the field, which could, however, impose prohibitive burden to clinicians for preparing fine-grained pixel-wise labels corresponding to the collected surgical video frames. In this work, we propose an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts. A self-supervised pre-trained model is firstly leveraged as a feature extractor to capture high-level semantic features. Then, Laplacian matrixs are computed from the features and are eigendecomposed for graph partitioning. On the "deep" eigenvectors, a surgical video frame is meaningfully segmented into different modules such as tools and tissues, providing distinguishable semantic information like locations, classes, and relations. The segmentation problem can then be naturally tackled by applying clustering or threshold on the eigenvectors. Extensive experiments are conducted on various datasets (e.g., EndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across all the challenging scenarios, our method demonstrates outstanding performance and robustness higher than unsupervised state-of-the-art (SOTA) methods. The code is released at https://github.com/MingyuShengSMY/GraphClusteringSIS.git.

Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

TL;DR

This work proposes an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts.

Abstract

Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-the-shelf approach in the field, which could, however, impose prohibitive burden to clinicians for preparing fine-grained pixel-wise labels corresponding to the collected surgical video frames. In this work, we propose an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts. A self-supervised pre-trained model is firstly leveraged as a feature extractor to capture high-level semantic features. Then, Laplacian matrixs are computed from the features and are eigendecomposed for graph partitioning. On the "deep" eigenvectors, a surgical video frame is meaningfully segmented into different modules such as tools and tissues, providing distinguishable semantic information like locations, classes, and relations. The segmentation problem can then be naturally tackled by applying clustering or threshold on the eigenvectors. Extensive experiments are conducted on various datasets (e.g., EndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across all the challenging scenarios, our method demonstrates outstanding performance and robustness higher than unsupervised state-of-the-art (SOTA) methods. The code is released at https://github.com/MingyuShengSMY/GraphClusteringSIS.git.
Paper Structure (19 sections, 3 equations, 5 figures, 6 tables)

This paper contains 19 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of Our Method. Every surgical video frame is fed into a ViT-based feature extractor to generate high-level dense features. Then, an affinity matrix $W$ is computed and its Laplacian matrix $L$ is calculated for the subsequent eigendecomposition, from which eigenvectors provide distinct features to distinguish different modules in a frame, where the first $g$ eigenvectors are stacked together for clustering and the second eigenvector (the Fiedler Vector) is leveraged for salient detection.
  • Figure 2: Sample Result of Eigendecomposition. The top-left is the origin image frame, and "$i$ th" represents a visualized eigenvector with the $i$-th smallest eigenvalue.
  • Figure 3: Eigenvectors Visualization. Each row demonstrates the input image and its corresponding eigenvectors. (a) - (d) are from ARTNetDataset, EndoVis2017, EndoVis2018, and UCL respectively. "$i$ th" indicates the eigenvector with the $i$-th smallest eigenvalue. The Fiedler vector is denoted by "2nd".
  • Figure 4: Salient Detection Visualization. Each row illustrates an input image (Image), its binary ground-truth (GT), and the corresponding saliency map generated from the Fiedler vector. (a) and (b) are from ARTNetDataset; (c), (d), and (e) are from EndoVis2017, EndoVis2018 and UCL, respectively.
  • Figure 5: Binary Segmentation Visualization. Each column demonstrates the input image (Image), binary ground-truth (GT), and prediction masks of AGSD and our CLU method. (a) and (b) are from EndoVis2017, (c) and (d) are from EndoVis2018, (e) and (f) are from UCL, where (a) shows a fail case of our CLU method.