Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical Flow

Yang Liu; Peiran Wu; Jiayu Huo; Gongyu Zhang; Zhen Yuan; Christos Bergeles; Rachel Sparks; Prokar Dasgupta; Alejandro Granados; Sebastien Ourselin

Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical Flow

Yang Liu, Peiran Wu, Jiayu Huo, Gongyu Zhang, Zhen Yuan, Christos Bergeles, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin

TL;DR

The paper addresses unsupervised surgical instrument segmentation in endoscopic videos, where low-quality optical flow hinders motion-based supervision. It introduces a motion-boundary-driven framework comprising High-Quality Area Matching (HQAM) to focus on reliable motion boundaries, Low-Quality Cases Dropping (LQCD) to discard globally weak-flow frames, and a variable frame-rate training scheme to capture subtle instrument motions, all built on a RAFT-informed backbone (RCF). The combined approach yields substantial $mIoU$ gains on the EndoVis 2017 VOS and Challenge datasets (about $0.750$ and $0.720$, respectively), outperforming prior unsupervised methods and improving over baselines by large margins. This plug-and-play framework reduces dependence on manual annotations, enabling scalable, annotation-free surgical instrument segmentation with potential extensions to other motion-driven tasks such as unsupervised depth estimation in clinical settings.

Abstract

Unsupervised video-based surgical instrument segmentation has the potential to accelerate the adoption of robot-assisted procedures by reducing the reliance on manual annotations. However, the generally low quality of optical flow in endoscopic footage poses a great challenge for unsupervised methods that rely heavily on motion cues. To overcome this limitation, we propose a novel approach that pinpoints motion boundaries, regions with abrupt flow changes, while selectively discarding frames with globally low-quality flow and adapting to varying motion patterns. Experiments on the EndoVis2017 VOS and EndoVis2017 Challenge datasets show that our method achieves mean Intersection-over-Union (mIoU) scores of 0.75 and 0.72, respectively, effectively alleviating the constraints imposed by suboptimal optical flow. This enables a more scalable and robust surgical instrument segmentation solution in clinical settings. The code will be publicly released.

Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical Flow

TL;DR

gains on the EndoVis 2017 VOS and Challenge datasets (about

and

, respectively), outperforming prior unsupervised methods and improving over baselines by large margins. This plug-and-play framework reduces dependence on manual annotations, enabling scalable, annotation-free surgical instrument segmentation with potential extensions to other motion-driven tasks such as unsupervised depth estimation in clinical settings.

Abstract

Paper Structure (11 sections, 4 equations, 4 figures, 3 tables)

This paper contains 11 sections, 4 equations, 4 figures, 3 tables.

Introduction
Method
High-Quality Area Matching
Low-Quality Cases Drop
Variable Frame Rates Training Input
Experiments
Datasets and Evaluation Metrics
Implementation Details
Comparison with State-of-the-art Methods
Ablation Study
Conclusion

Figures (4)

Figure 1: Example of some low-quality optical flow frames, including stationary instruments, dark areas and abrupt movements, which greatly limit the model performance.
Figure 2: Overview of our proposed unsupervised instrument segmentation framework. Two frames, separated by a random interval $r$, are fed into both a motion-guided segmentation model (e.g. RCF Lian_2023_CVPR) and a pre-trained Motion Estimator ( e.g. RAFT teed2020raft) that generates pseudo flow maps $o_t$. The proposed HQAM and LQCD modules refine these pseudo flow maps, yielding a robust supervision.
Figure 3: Illustration of our HQAM and LQCD modules. HQAM derives a boundary-based mask from pseudo optical flows $o_t$, isolating reliable high-quality regions to guide segmentation. Meanwhile, LQCD ranks each frame in a batch by its per-frame loss and discards the top $h$ "hard cases", removing globally low-quality motion signals.
Figure 4: Qualitative comparisons with the baseline model RCF, showing (a) optical flow pseudo-labels obtained by RAFT prediction (b) Ground Truth from EndoVis 2017, offering (c) Prediction masks of our method (d) Prediction masks of RCF.

Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical Flow

TL;DR

Abstract

Motion-Boundary-Driven Unsupervised Surgical Instrument Segmentation in Low-Quality Optical Flow

Authors

TL;DR

Abstract

Table of Contents

Figures (4)