Table of Contents
Fetching ...

Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

Anita Rau, Mark Endo, Josiah Aklilu, Jaewoo Heo, Khaled Saab, Alberto Paderno, Jeffrey Jopling, F. Christopher Holsinger, Serena Yeung-Levy

TL;DR

This paper systematically evaluates 11 large vision-language models across 17 surgical visual tasks using 13 datasets to assess generalization, temporal/spatial reasoning, and adaptability. It demonstrates that while zero-shot capabilities exist, substantial gaps persist in precise localization and dynamic reasoning, with in-context learning providing notable performance boosts (up to three-fold) and reducing the gap to task-specific SOTA in some settings. Domain-specific models like SurgVLP and medical-tuned variants outperform generalist models on domain-relevant tasks, though open, contrastive models often underperform on fine-grained surgical cues. The study highlights near-term opportunities for VLMs in workflow optimization and operative-note generation, while underscoring the need for richer surgical data and improved reasoning to enable robust clinical deployment.

Abstract

Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.

Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

TL;DR

This paper systematically evaluates 11 large vision-language models across 17 surgical visual tasks using 13 datasets to assess generalization, temporal/spatial reasoning, and adaptability. It demonstrates that while zero-shot capabilities exist, substantial gaps persist in precise localization and dynamic reasoning, with in-context learning providing notable performance boosts (up to three-fold) and reducing the gap to task-specific SOTA in some settings. Domain-specific models like SurgVLP and medical-tuned variants outperform generalist models on domain-relevant tasks, though open, contrastive models often underperform on fine-grained surgical cues. The study highlights near-term opportunities for VLMs in workflow optimization and operative-note generation, while underscoring the need for richer surgical data and improved reasoning to enable robust clinical deployment.

Abstract

Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.

Paper Structure

This paper contains 7 sections, 6 figures, 43 tables.

Figures (6)

  • Figure 2: We evaluate 11 VLMs on 38 task instances. Larger values indicate better performance for all metrics. Bar plots compare average VLM performance to state-of-the-art supervised models (SOTA). SOTA values are averaged over official results where available (denoted by * and detailed in the Supplement). Med-Gemini results are sparse due to licensing restrictions. Task comparisons vary: Evaluating Gestures, Skill, and Errors are video-based tasks (denoted by $^v$) and require temporal understanding; detection and segmentation also require specialized spatial localization capabilities. Additional metrics in Supplement. A) Surgical scene comprehension: VLMs recognize surgical objects but struggle with localization; few support detection or segmentation. To aid contextualization, we compare segmentation foundation models (SAM2/MedSAM), which generalize without training but are not VLMs. B) Surgical progression understanding: GPT-4o excels in procedural understanding, including action and phase recognition, with the open-source SurgVLP as a strong alternative. Gesture recognition in videos remains unsolved. C) Surgical safety & performance assessment: Open-source contrastive models outperform proprietary ones in risk/safety assessment, and video tasks remain challenging.
  • Figure 3: Qualitative zero-shot examples for various tasks, models, and datasets. Correct predictions are shown in bold. Prompts shortened for display; full versions in the Supplement. A) Gemini impresses in hand and tool detection, even spotting unannotated tools and hands. PaliGemma repeatedly predicts the same objects. B) GPT-4o leads at action prediction. C) Generalist models can infer surgical knowledge from general knowledge, in this example tying the term "gallbladder packaging" to the easily discernible plastic bag in the shown example. D) CVS assessment is challenging for auto-regressive models. Contrastive models such as SurgVLP and CLIP can more accurately discriminate subtle visual cues in this task. E) GPT-4o outperforms all models at error classification. The error can be identified based on the thermal injury on the liver which we marked here with a green arrow, but the injury is missed by all models except GPT.
  • Figure 4: Comparing VLMs (colored bars) with task-specific SOTA models that are evaluated out-of-domain (dark gray) highlights the generalization capabilities of VLMs. In this experiment SOTA models were evaluated on a different dataset than they had been trained on, but still performed the same task as during training. In this out-of-domain setting, VLMs perform competitively with SOTA models and even surpass them in zero-shot tool recognition. We also compare 5-shot results for models that have in-context capabilities (GPT and Gemini). Performance is reported using F1 scores for CVS assessment, phase recognition, and tool recognition, while anatomy detection is evaluated using mAP@.5:.95. For reference, we also include SOTA in-domain results, but note that these are not directly comparable to VLM results.
  • Figure 5: In-context learning by providing 1, 3, or 5 examples per class can improve model performance significantly versus the zero shot setting. Task-specific SOTA model results are provided for context when they are available. The score is F1 for all recognition tasks, and mAP@.5:.95 for detection (Det) tasks.
  • Figure : A
  • ...and 1 more figures