Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence
Anita Rau, Mark Endo, Josiah Aklilu, Jaewoo Heo, Khaled Saab, Alberto Paderno, Jeffrey Jopling, F. Christopher Holsinger, Serena Yeung-Levy
TL;DR
This paper systematically evaluates 11 large vision-language models across 17 surgical visual tasks using 13 datasets to assess generalization, temporal/spatial reasoning, and adaptability. It demonstrates that while zero-shot capabilities exist, substantial gaps persist in precise localization and dynamic reasoning, with in-context learning providing notable performance boosts (up to three-fold) and reducing the gap to task-specific SOTA in some settings. Domain-specific models like SurgVLP and medical-tuned variants outperform generalist models on domain-relevant tasks, though open, contrastive models often underperform on fine-grained surgical cues. The study highlights near-term opportunities for VLMs in workflow optimization and operative-note generation, while underscoring the need for richer surgical data and improved reasoning to enable robust clinical deployment.
Abstract
Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.
