Table of Contents
Fetching ...

VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons

Zhen Chen, Xingjian Luo, Jinlin Wu, Danny T. M. Chan, Zhen Lei, Jinqiao Wang, Sebastien Ourselin, Hongbin Liu

TL;DR

VS-Assistant addresses the fragmentation of surgical AI by unifying multimodal understanding and intent-driven function calling. It combines a surgically fine-tuned LLM with a Mixture of Projectors to align visual inputs to the surgical semantic space and a CoT-guided function-calling framework to invoke specialized surgical algorithms on demand. On neurosurgery data, it achieves high function-calling success and strong instrument-detection/segmentation performance, outperforming several state-of-the-art multimodal models. The approach enables a practical, adaptable AI assistant for the operating room with public-release potential.

Abstract

The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public.

VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons

TL;DR

VS-Assistant addresses the fragmentation of surgical AI by unifying multimodal understanding and intent-driven function calling. It combines a surgically fine-tuned LLM with a Mixture of Projectors to align visual inputs to the surgical semantic space and a CoT-guided function-calling framework to invoke specialized surgical algorithms on demand. On neurosurgery data, it achieves high function-calling success and strong instrument-detection/segmentation performance, outperforming several state-of-the-art multimodal models. The approach enables a practical, adaptable AI assistant for the operating room with public-release potential.

Abstract

The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public.
Paper Structure (11 sections, 5 equations, 3 figures, 5 tables)

This paper contains 11 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) The VS-Assistant framework consists of the visual encoder, the mixture-of-projector (MOP), the surgical LLM and the surgical function set. The VS-Assistant is capable to accomplish diverse surgical tasks on the demand of surgeons. (b) The training pipeline of VS-Assistant, including surgical LLM tuning, MOP tuning, and function-calling tuning.
  • Figure 2: Components of CoT-based function-calling conversations.
  • Figure 3: The conversation records of the VS-Assistant, including (a) without calling surgical functions and (b) calling surgical functions of detection and segmentation.