Table of Contents
Fetching ...

GP-VLS: A general-purpose vision language model for surgery

Samuel Schmidgall, Joseph Cho, Cyril Zakka, William Hiesinger

TL;DR

GP-VLS introduces an open-source, general-purpose vision-language framework tailored for surgery by uniting broad medical and surgical knowledge with visual scene understanding to support natural-language interaction in clinical settings. It builds on visual instruction tuning and broad VLM research, and notably pairs six new training datasets with a novel SurgiQual benchmark to evaluate medical, surgical, and vision-language capabilities. Empirically, GP-VLS achieves strong performance across SurgiQual tasks—outperforming open- and closed-source baselines by substantial margins—and demonstrates competitive medical-surgical knowledge proficiency relative to public baselines. The work provides a foundation for AI surgical assistants with scalable data, open resources, and a clear evaluation pathway, while acknowledging limitations such as coverage gaps for rare procedures and the need for clinical validation.

Abstract

Surgery requires comprehensive medical knowledge, visual assessment skills, and procedural expertise. While recent surgical AI models have focused on solving task-specific problems, there is a need for general-purpose systems that can understand surgical scenes and interact through natural language. This paper introduces GP-VLS, a general-purpose vision language model for surgery that integrates medical and surgical knowledge with visual scene understanding. For comprehensively evaluating general-purpose surgical models, we propose SurgiQual, which evaluates across medical and surgical knowledge benchmarks as well as surgical vision-language questions. To train GP-VLS, we develop six new datasets spanning medical knowledge, surgical textbooks, and vision-language pairs for tasks like phase recognition and tool identification. We show that GP-VLS significantly outperforms existing open- and closed-source models on surgical vision-language tasks, with 8-21% improvements in accuracy across SurgiQual benchmarks. GP-VLS also demonstrates strong performance on medical and surgical knowledge tests compared to open-source alternatives. Overall, GP-VLS provides an open-source foundation for developing AI assistants to support surgeons across a wide range of tasks and scenarios. The code and data for this work is publicly available at gpvls-surgery-vlm.github.io.

GP-VLS: A general-purpose vision language model for surgery

TL;DR

GP-VLS introduces an open-source, general-purpose vision-language framework tailored for surgery by uniting broad medical and surgical knowledge with visual scene understanding to support natural-language interaction in clinical settings. It builds on visual instruction tuning and broad VLM research, and notably pairs six new training datasets with a novel SurgiQual benchmark to evaluate medical, surgical, and vision-language capabilities. Empirically, GP-VLS achieves strong performance across SurgiQual tasks—outperforming open- and closed-source baselines by substantial margins—and demonstrates competitive medical-surgical knowledge proficiency relative to public baselines. The work provides a foundation for AI surgical assistants with scalable data, open resources, and a clear evaluation pathway, while acknowledging limitations such as coverage gaps for rare procedures and the need for clinical validation.

Abstract

Surgery requires comprehensive medical knowledge, visual assessment skills, and procedural expertise. While recent surgical AI models have focused on solving task-specific problems, there is a need for general-purpose systems that can understand surgical scenes and interact through natural language. This paper introduces GP-VLS, a general-purpose vision language model for surgery that integrates medical and surgical knowledge with visual scene understanding. For comprehensively evaluating general-purpose surgical models, we propose SurgiQual, which evaluates across medical and surgical knowledge benchmarks as well as surgical vision-language questions. To train GP-VLS, we develop six new datasets spanning medical knowledge, surgical textbooks, and vision-language pairs for tasks like phase recognition and tool identification. We show that GP-VLS significantly outperforms existing open- and closed-source models on surgical vision-language tasks, with 8-21% improvements in accuracy across SurgiQual benchmarks. GP-VLS also demonstrates strong performance on medical and surgical knowledge tests compared to open-source alternatives. Overall, GP-VLS provides an open-source foundation for developing AI assistants to support surgeons across a wide range of tasks and scenarios. The code and data for this work is publicly available at gpvls-surgery-vlm.github.io.
Paper Structure (35 sections, 3 equations, 2 figures, 2 tables)

This paper contains 35 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Visual depiction of the General Purpose Vision Language Assistant for Surgery (GP-VLS) and the content used to train it. GP-VLS is trained on and is able to perform language-only and vision-language problems.
  • Figure 2: Example questions from each of the six categories of SurgiQual.