Table of Contents
Fetching ...

Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding

David Gastager, Ghazal Ghazaei, Constantin Patsch

TL;DR

The paper tackles the challenge of scarce and heterogeneous annotated data for automated surgical workflow analysis by proposing a watch-and-learn paradigm that leverages expert commentary. It introduces a two-stage architecture: a VALOR-based stage-1 video-language model that learns short-range representations from a large YouTube cataract video dataset, and a stage-2 task-specific temporal model that captures long-range dependencies for phase segmentation; parameter-efficient LoRA fine-tuning enables cross-dataset adaptation. The approach achieves state-of-the-art or competitive results on phase segmentation across CATARACTS, Cataract-101, and Cholec80, with notable zero-shot generalization, and pioneers dense video captioning for surgical videos via a two-stage, long-range, video-language framework. The work demonstrates strong generalizability and enables new downstream tasks, offering a scalable path toward comprehensive automated surgical video understanding with practical educational and clinical impact.

Abstract

Automated surgical workflow analysis is crucial for education, research, and clinical decision-making, but the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations. Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain. Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 8% in zero-shot phase segmentation, and comparable capabilities to fully-supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain. We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.

Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding

TL;DR

The paper tackles the challenge of scarce and heterogeneous annotated data for automated surgical workflow analysis by proposing a watch-and-learn paradigm that leverages expert commentary. It introduces a two-stage architecture: a VALOR-based stage-1 video-language model that learns short-range representations from a large YouTube cataract video dataset, and a stage-2 task-specific temporal model that captures long-range dependencies for phase segmentation; parameter-efficient LoRA fine-tuning enables cross-dataset adaptation. The approach achieves state-of-the-art or competitive results on phase segmentation across CATARACTS, Cataract-101, and Cholec80, with notable zero-shot generalization, and pioneers dense video captioning for surgical videos via a two-stage, long-range, video-language framework. The work demonstrates strong generalizability and enables new downstream tasks, offering a scalable path toward comprehensive automated surgical video understanding with practical educational and clinical impact.

Abstract

Automated surgical workflow analysis is crucial for education, research, and clinical decision-making, but the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations. Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain. Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 8% in zero-shot phase segmentation, and comparable capabilities to fully-supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain. We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.

Paper Structure

This paper contains 25 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of proposed two-stage architecture: Stage 1) A modified VALORValor video-language model processes clips of $N$ frames while maintaining short-range dependencies. It is pretrained on our large-scale video-language dataset and fine-tuned on target datasets using LoRA Lora. Stage 2) A temporal model TcnAsformervideomambasuite captures global interactions over a whole video. $t_i$ refers to frame timestamp and $c_i$ to clip number.
  • Figure 2: Left: The first two PCA components of the stage 1 output before (left) and after (right) training on our YT-dataset. Right: Example phase predictions of different variants of our method. Bottom: Label legend. All examples on C101.
  • Figure 3: Example captions and phase predictions on two videos of the CATARACTS test set. (A) is from video 2, which includes a surgery complication (iris prolapse). (B) represents parts of video 8. The fully captioned videos are available in the supplementary materials.
  • Figure 4: (A) Predictions of our model variants on a difficult video ($\sim8$ min duration, expert surgeon) from the C101 test set. (B) Predictions of the V-YT-LoRA-CAT-TCN model trained with different subsets of the C101 training set on a difficult video ($15.5$ min duration, novice surgeon). (C) Zero-Shot predictions of different stage 1 models on a medium difficulty video of C101 ($\sim7.5$ min, novice surgeon). The zero-shot predictions lack temporal consistency, as they did not include a stage 2 model. The impact of LoRA is still clearly visible capturing the general trend of the phase segmentation.
  • Figure 5: Example predictions of V-CT50-ASF and V-CT50-TCN on a medium difficult ((A): video_67.mp4) and the most difficult ((B): video_58.mp4) examples from the Cholec80 test set. Difficulty is relative and was measured by the achieved F1 scores by the models.
  • ...and 2 more figures