Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Kun Yuan; Vinkle Srivastav; Tong Yu; Joel L. Lavanchy; Jacques Marescaux; Pietro Mascagni; Nassir Navab; Nicolas Padoy

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Kun Yuan, Vinkle Srivastav, Tong Yu, Joel L. Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, Nicolas Padoy

TL;DR

SurgVLP tackles the lack of scalable, annotation-light supervision in surgical vision by leveraging open surgical video lectures and dual ASR transcripts (AWS and Whisper) to learn a joint vision-language representation. The approach employs a dual-branch architecture with ResNet-50 visual encoding and BioClinicalBert text encoding, trained via a combined InfoNCE and MIL-NCE objective to align video clips with two text views. It introduces the Surgical Video Lecture (SVL) pretraining dataset and demonstrates that the learned representations transfer in zero-shot fashion to both vision-and-language tasks (retrieval, grounding, captioning) and vision-only tasks (tool, phase, and triplet recognition) across multiple datasets, aided by contextual prompts and careful text encoding. The results indicate strong zero-shot performance and highlight SurgVLP as a scalable foundation for surgical workflow analysis, while acknowledging limitations in fine-grained anatomical reasoning and ASR domain gaps, suggesting avenues for future refinement and domain adaptation.

Abstract

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The [training code](https://github.com/CAMMA-public/PeskaVLP) and [weights](https://github.com/CAMMA-public/SurgVLP) are public.

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 10 figures, 11 tables)

This paper contains 29 sections, 5 equations, 10 figures, 11 tables.

Introduction
Related Works
Advances in representation learning methods
Downstream tasks
Approach
Video clip-text pairs
Dual-branch model
Visual encoder
Text encoder
Multiple text-views contrastive supervision
Downstream tasks and SurgVLP's adaptation
Vision-and-language surgical tasks
Vision-only surgical tasks
Experiments
Implementation details
...and 14 more sections

Figures (10)

Figure 1: Examples of video clip-text pairs from SVL dataset. The video clip-text pairs are pairs of video clips and their corresponding transcripts. We generate transcripts for hundreds of surgical video lectures using two ASR systems, i.e., AWS Medical Transcribe AWS and Whisper radford2022robust. The transcripts usually illustrate the essential concept of surgical anatomies, instruments and events. We use large-scale video clip-text pairs to learn joint multi-modal representations.
Figure 2: Pipeline of proposed SurgVLP. Figure (a) shows examples of video clip-text pairs and their construction process. We have two text views and we pair them to random lengths of video clips. Figure (b) presents the contrastive learning objective with AWS sentences and Whisper sentences. SurgVLP utilizes the Info-NCE and MIL-NCE losses for AWS and Whisper sentences, respectively. Figure (c) illustrates how to perform downstream tasks in the zero-shot setting. We show the vision-and-language tasks, e.g., text-based video retrieval and temporal activity grounding, at the top and the vision-only tasks at the bottom.
Figure 3: Text-only-training for video captioning: We use the learned joint embedding space where text is encoded in a representation close to the ones of its corresponding video clips. During training, we train the text decoder to generate captions from text embeddings. During inference, the visual embeddings are fed to the visual encoder and then to the text decoder to generate the text captions.
Figure 4: Qualitative results of text-based video retrieval on SVL-Retrieval dataset using SurgVLP's learned joint multi-modal representations. For each language query, we retrieve $3$ video clips from the repository. The ground truth video clip is framed in green. It is here always mentioned in the top-3 results.
Figure 5: Textual-visual activation maps from different sentence queries. The first row shows the ground truth. The second row shows the predicted activation map along the time axis for the raw sentence. The third row shows the newly generated activation maps conditioned by modified sentences. When the whole sentence is decomposed into sub-sentences, the SurgVLP approach generates a focused textual-visual activation map for the sentence with clear and less ambiguous words. This shows that SurgVLP responds to specific surgical terms rather than general terminology.
...and 5 more figures

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

TL;DR

Abstract

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Authors

TL;DR

Abstract

Table of Contents

Figures (10)