Table of Contents
Fetching ...

Surgical Video Understanding with Label Interpolation

Garam Kim, Tae Kyeong Jeong, Juyoun Park

Abstract

Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

Surgical Video Understanding with Label Interpolation

Abstract

Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

Paper Structure

This paper contains 16 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Temporal–spatial annotation imbalance in medical datasets. Illustration of the imbalance between temporal annotation(phase, step, and step anticipation available for every frame) and spatial annotations (instrument segmentation and action detection only annotated on key frames)
  • Figure 2: Overview of the proposed SurgMINT Framework. Segmentation labels are interpolated to support robust multi-task surgical video understanding, covering phase/step recognition, step anticipation, and instrument/action detection.
  • Figure 3: Framework for segmentation label interpolation using optical flow, corresponding to the warping branch in Fig. \ref{['fig:fig2']}
  • Figure 4: Results of each branch during the label interpolation process. Left: RGB image; middle: predicted mask; right: overlay.
  • Figure 5: Training process of SurgMINT. (a) After training the instrument segmentation model, (b) all tasks—including phase/step recognition, step anticipation, and instrument/action detection—are fine-tuned together based on the trained segmentation model.
  • ...and 4 more figures