Table of Contents
Fetching ...

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

Abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

Paper Structure

This paper contains 52 sections, 9 figures, 37 tables.

Figures (9)

  • Figure 1: Example frames from SDSC-EEA with zero-shot predictions from Gemma 3 27B. Top row: correct detections (left to right: Drill + Suction; Suction; Drill + Suction; no tools; no tools). Bottom row: incorrect detections, left to right: $y$ = Drill, Suction; $\hat{y}$ = Curette, Grasper, Irrigation, Monopolar Electrocautery, Suction; $y$ = Cotton Patty, Rhoton Dissector, Suction; $\hat{y}$ = Grasper, Monopolar Electrocautery, Suction; $y$ = Bipolar Forceps, Suction; $\hat{y}$ = Curette, Drill, Suction, Tissue shaver; $y$ = Suction; $\hat{y}$ = Grasper, Monopolar Electrocautery, Suction; $y$ = Rhoton Dissector; $\hat{y}$ = Monopolar Electrocautery, Suction.
  • Figure 2: Exact-match accuracy on the SDSC-EEA validation set ($n=20{,}016$) as a function of model parameter count. Colors and marker shapes denote model families. The black dashed line indicates the majority-class baseline (13.4%). Accuracy exhibits a positive but strongly sublinear relationship with parameter count; the relationship is family-dependent, with Qwen models consistently outperforming similarly-sized Gemma and Llama models.
  • Figure 3: Zero-shot exact-match accuracy on the SDSC-EEA validation set ($n=20{,}016$) plotted against MMBench score. Colors and marker shapes denote model families. The black dashed line indicates the majority-class baseline (13.4%). Higher MMBench scores correlate with higher tool detection accuracy, but even the best model (Qwen3-VL-235B, MMBench 90.6) achieves only 14.52%---far below fine-tuned models (51.08%, Section \ref{['subsec:lora_classification_head']}).
  • Figure 4: Training dynamics for LoRA fine-tuning with JSON output on SDSC-EEA ($r=1024$). Left: Training loss (log scale) decreases steadily, confirming the model learns the structured output format. Center: Exact match accuracy. Right: Jaccard similarity. Both accuracy and Jaccard show a persistent gap between training and validation performance, indicating limited generalization to held-out procedures. Metrics are computed on fixed random subsets of 100 frames from each set, evaluated 100 times throughout training.
  • Figure 5: Training dynamics for LoRA fine-tuning with classification head on SDSC-EEA ($r=1024$). Left: Training loss (log scale). Center: Exact match accuracy. Right: Jaccard similarity. The classification head achieves the highest validation accuracy among all VLM-based methods (51.08%), outperforming JSON generation at the same LoRA rank (47.63%, Figure \ref{['fig:s19_loss']}). The persistent train--validation gap reflects limited generalization to held-out procedures. Metrics are computed on fixed random subsets of 100 frames from each set, approximately 100 times throughout training.
  • ...and 4 more figures