EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Guankun Wang; Long Bai; Junyi Wang; Kun Yuan; Zhen Li; Tianxu Jiang; Xiting He; Jinlin Wu; Zhen Chen; Zhen Lei; Hongbin Liu; Jiazheng Wang; Fan Zhang; Nicolas Padoy; Nassir Navab; Hongliang Ren

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, Hongbin Liu, Jiazheng Wang, Fan Zhang, Nicolas Padoy, Nassir Navab, Hongliang Ren

TL;DR

EndoChat addresses the need for grounded multimodal understanding in robotic-assisted endoscopic surgery by introducing Surg-396K, a large-scale surgical image-instruction dataset, and the Mixed Visual Token Engine with a visual-contrast mechanism to reduce hallucinations. Built on SPHINX/LLaMA-2 with LoRA, EndoChat achieves state-of-the-art performance across multiple dialogue paradigms and surgical sub-tasks, and receives positive evaluations from expert endoscopists. The work demonstrates a scalable framework for surgeon-system interaction, enabling open-ended, context-aware guidance and training in complex endoscopic scenarios. It also discusses deployment and ethical considerations, outlining future directions for broader clinical validation and real-world integration.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 4 figures, 8 tables)

This paper contains 30 sections, 6 equations, 4 figures, 8 tables.

Introduction
Related work
Multimodal Large Language Models
Surgical Vision-Language Models
Methods
Surgical Multimodal Instruction Dataset: Surg-396K
Preliminary for Constituent Datasets
Attribute Retrieval
Diverse Conversation Generation
Surgical Sub-task Formulation
Data Cleaning
Comparison with Existing Surgical Scene Understanding Datasets
Visual Enhanced MLLM: EndoChat
Preliminary for EndoChat
Mixed Visual Token Engine
...and 15 more sections

Figures (4)

Figure 1: Overview of the EndoChat. a EndoChat is an interactive multimodal chatbot designed for surgical education and training. Users can interact with EndoChat by uploading images and formulating questions, enabling a comprehensive surgical scene understanding. b EndoChat is trained on Surg-396K, a large-scale multimodal instruction dataset. Surg-396K includes five conversation paradigms, enabling EndoChat to effectively perform natural language and visual grounding conversations with trainees. On the bottom is an example of the multi-turn conversation.
Figure 2: Overview of the construction pipeline and distribution statistics for our Surg-396K dataset. The pipeline involves five key steps: annotation attribute analysis, information extraction, instruction-tuning data generation, diverse conversation generation, and data cleaning.
Figure 3: The overview of the proposed EndoChat. For each input image, we use a multi-scale downsampling strategy to generate different scales and sub-images. $224^2$ and $512^2$ indicate concatenated features with the shapes 5×224×224×3 and 5×512×512×3, respectively. These features are subsequently encoded using a mixed visual backbone, followed by the Mixed Visual Token Engine. The resulting vision tokens are then transformed into language space, suitable for input to the Large Language Model. In addition to visual inputs, region coordinates can be auxiliary inputs, along with specific prompts to guide user-defined tasks. This enables the LLM to generate language responses for related object regions.
Figure 4: Endoscopist evaluation of EndoChat in 150 cases. a Questionnaire-based evaluation of EndoChat conducted by endoscopists. The pie charts illustrate the distribution of cases in which endoscopists express varying levels of agreement. b Correlation analysis of four evaluation standards.

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

TL;DR

Abstract

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Authors

TL;DR

Abstract

Table of Contents

Figures (4)