VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

Mohammadmahdi Honarmand; Muhammad Abdullah Jamal; Omid Mohareri

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

Mohammadmahdi Honarmand, Muhammad Abdullah Jamal, Omid Mohareri

TL;DR

VidLPRO introduces a multi-objective video-language pre-training framework tailored for robotic and laparoscopic surgery, combining Video-Text Contrastive Learning, Video-Text Matching, and Masked Language Modeling to capture temporal dynamics and cross-modal alignment. The GenSurg+ dataset, built from GenSurgery with audio-filtered videos, Whisper transcripts, and GPT-4 captions, provides a large-scale, high-quality foundation for pretraining. Empirical results show state-of-the-art zero-shot surgical phase recognition on Cholec80 and AutoLaparo, with strong robustness to frame sampling and clear benefits from temporal context. Together, VidLPRO and GenSurg+ establish a scalable, domain-specific foundation model for surgical video understanding with practical implications for training, guidance, and decision-support in real-world procedures.

Abstract

We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery. While existing surgical VL models primarily rely on contrastive learning, we propose a more comprehensive approach to capture the intricate temporal dynamics and align video with language. VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations. To support this framework, we present GenSurg+, a carefully curated dataset derived from GenSurgery, comprising 17k surgical video clips paired with captions generated by GPT-4 using transcripts extracted by the Whisper model. This dataset addresses the need for large-scale, high-quality VL data in the surgical domain. Extensive experiments on benchmark datasets, including Cholec80 and AutoLaparo, demonstrate the efficacy of our approach. VidLPRO achieves state-of-the-art performance in zero-shot surgical phase recognition, significantly outperforming existing surgical VL models such as SurgVLP and HecVL. Our model demonstrates improvements of up to 21.5\% in accuracy and 15.7% in F1 score, setting a new benchmark in the field. Notably, VidLPRO exhibits robust performance even with single-frame inference, while effectively scaling with increased temporal context. Ablation studies reveal the impact of frame sampling strategies on model performance and computational efficiency. These results underscore VidLPRO's potential as a foundation model for surgical video understanding.

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 3 figures, 6 tables)

This paper contains 33 sections, 12 equations, 3 figures, 6 tables.

Introduction
Related Work
Vision-Language models
Surgical Video-Language Pretraining
Surgical Phase Recognition
Method
GenSurg+
Dataset Creation Pipeline
Audio Filtering.
Transcript Extraction.
Video Segmentation and Filtering.
Caption Generation.
Dataset Statistics and Characteristics
VidLPRO
Model Architecture
...and 18 more sections

Figures (3)

Figure 1: Current approaches (left) rely on video-text contrastive loss only, while our method (right), besides contrastive loss, employ video-text matching loss and masked language modeling to enhance cross-modal fusion and surgical language.
Figure 2: Overview of the GenSurg+ dataset creation pipeline.
Figure 3: Overview of the VidLPRO model architecture and configuration. The model employs a Vision Transformer (ViT) as the video encoder and BERT as the text encoder. The multimodal fusion module integrates visual and textual representations, while pre-training objectives such as Video-Text Contrastive Learning (VTC), Video-Text Matching (VTM), and Masked Language Modeling (MLM) ensure comprehensive learning of multimodal representations.

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

TL;DR

Abstract

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

Authors

TL;DR

Abstract

Table of Contents

Figures (3)