GP-VLS: A general-purpose vision language model for surgery

Samuel Schmidgall; Joseph Cho; Cyril Zakka; William Hiesinger

GP-VLS: A general-purpose vision language model for surgery

Samuel Schmidgall, Joseph Cho, Cyril Zakka, William Hiesinger

TL;DR

GP-VLS introduces an open-source, general-purpose vision-language framework tailored for surgery by uniting broad medical and surgical knowledge with visual scene understanding to support natural-language interaction in clinical settings. It builds on visual instruction tuning and broad VLM research, and notably pairs six new training datasets with a novel SurgiQual benchmark to evaluate medical, surgical, and vision-language capabilities. Empirically, GP-VLS achieves strong performance across SurgiQual tasks—outperforming open- and closed-source baselines by substantial margins—and demonstrates competitive medical-surgical knowledge proficiency relative to public baselines. The work provides a foundation for AI surgical assistants with scalable data, open resources, and a clear evaluation pathway, while acknowledging limitations such as coverage gaps for rare procedures and the need for clinical validation.

Abstract

Surgery requires comprehensive medical knowledge, visual assessment skills, and procedural expertise. While recent surgical AI models have focused on solving task-specific problems, there is a need for general-purpose systems that can understand surgical scenes and interact through natural language. This paper introduces GP-VLS, a general-purpose vision language model for surgery that integrates medical and surgical knowledge with visual scene understanding. For comprehensively evaluating general-purpose surgical models, we propose SurgiQual, which evaluates across medical and surgical knowledge benchmarks as well as surgical vision-language questions. To train GP-VLS, we develop six new datasets spanning medical knowledge, surgical textbooks, and vision-language pairs for tasks like phase recognition and tool identification. We show that GP-VLS significantly outperforms existing open- and closed-source models on surgical vision-language tasks, with 8-21% improvements in accuracy across SurgiQual benchmarks. GP-VLS also demonstrates strong performance on medical and surgical knowledge tests compared to open-source alternatives. Overall, GP-VLS provides an open-source foundation for developing AI assistants to support surgeons across a wide range of tasks and scenarios. The code and data for this work is publicly available at gpvls-surgery-vlm.github.io.

GP-VLS: A general-purpose vision language model for surgery

TL;DR

Abstract

GP-VLS: A general-purpose vision language model for surgery

Authors

TL;DR

Abstract

Table of Contents

Figures (2)