Table of Contents
Fetching ...

Robust Surgical Phase Recognition From Annotation Efficient Supervision

Or Rubin, Shlomi Laufer

TL;DR

This work proposes a robust method for surgical phase recognition that can handle missing phase annotations effectively and introduces the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance.

Abstract

Surgical phase recognition is a key task in computer-assisted surgery, aiming to automatically identify and categorize the different phases within a surgical procedure. Despite substantial advancements, most current approaches rely on fully supervised training, requiring expensive and time-consuming frame-level annotations. Timestamp supervision has recently emerged as a promising alternative, significantly reducing annotation costs while maintaining competitive performance. However, models trained on timestamp annotations can be negatively impacted by missing phase annotations, leading to a potential drawback in real-world scenarios. In this work, we address this issue by proposing a robust method for surgical phase recognition that can handle missing phase annotations effectively. Furthermore, we introduce the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance. Our method achieves competitive results on two challenging datasets, demonstrating its efficacy in handling missing phase annotations and its potential for reducing annotation costs. Specifically, we achieve an accuracy of 85.1\% on the MultiBypass140 dataset using only 3 annotated frames per video, showcasing the effectiveness of our method and the potential of the SkipTag@K setup. We perform extensive experiments to validate the robustness of our method and provide valuable insights to guide future research in surgical phase recognition. Our work contributes to the advancement of surgical workflow recognition and paves the way for more efficient and reliable surgical phase recognition systems.

Robust Surgical Phase Recognition From Annotation Efficient Supervision

TL;DR

This work proposes a robust method for surgical phase recognition that can handle missing phase annotations effectively and introduces the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance.

Abstract

Surgical phase recognition is a key task in computer-assisted surgery, aiming to automatically identify and categorize the different phases within a surgical procedure. Despite substantial advancements, most current approaches rely on fully supervised training, requiring expensive and time-consuming frame-level annotations. Timestamp supervision has recently emerged as a promising alternative, significantly reducing annotation costs while maintaining competitive performance. However, models trained on timestamp annotations can be negatively impacted by missing phase annotations, leading to a potential drawback in real-world scenarios. In this work, we address this issue by proposing a robust method for surgical phase recognition that can handle missing phase annotations effectively. Furthermore, we introduce the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance. Our method achieves competitive results on two challenging datasets, demonstrating its efficacy in handling missing phase annotations and its potential for reducing annotation costs. Specifically, we achieve an accuracy of 85.1\% on the MultiBypass140 dataset using only 3 annotated frames per video, showcasing the effectiveness of our method and the potential of the SkipTag@K setup. We perform extensive experiments to validate the robustness of our method and provide valuable insights to guide future research in surgical phase recognition. Our work contributes to the advancement of surgical workflow recognition and paves the way for more efficient and reliable surgical phase recognition systems.
Paper Structure (23 sections, 15 equations, 4 figures, 6 tables)

This paper contains 23 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our proposed surgical phase recognition method. (a) The model prediction pipeline for generating phase predictions from an input surgical video. It consists of a two-stage architecture: a feature extraction model followed by a temporal model. (b) The feature extractor training pipeline. A ResNet-50 model pre-trained on ImageNet is fine-tuned using self-supervised learning on the target surgical video dataset. (c) The temporal model training pipeline. It includes an initial training stage using our proposed loss function to create a base model. The base model then generates pseudo-labels which are used to train the final temporal model.
  • Figure 2: Illustration of the uncertainty measure used to identify phase transition events with different temperature scaling values (T) on a Cholec80 video. The black vertical lines represent a surrounding window of 2W frames centered on the ground truth transition events between surgical phases. Lower temperature values result in more stable uncertainty measure, enabling more robust transition event detection compared to using no temperature scaling (T=1). The red horizontal line represent the uncertainty threshold.
  • Figure 3: Distribution of per-frame phase annotations in the Cholec80 dataset. The full distribution (yellow) reveals significant class imbalance, while the timestamp labels (pink) are more evenly distributed. SkipTag@K sampling with K=2, 4, and 7 (blue shades) effectively captures the original data distribution, with increasing similarity to the full distribution as K increases.
  • Figure 4: Robustness comparison between our method and Ding et al.'s approach under different missing rate probabilities for the Cholec80 and MultiBypass140 datasets. The six subfigures depict the evaluation metrics of accuracy, Jaccard index, and F1 score for both datasets. As the missing rate increases, Ding et al.'s method experiences a significant drop in performance across all metrics on both datasets, while our method maintains stable performance with only a slight decline.