Table of Contents
Fetching ...

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh, In Cheul Choi, Jae-Ho Chung, Jong Woong Park, Juyoun Park

TL;DR

SurgMLLMBench presents a unified multimodal benchmark for surgical scene understanding by integrating six datasets, including the new MAVIS micro-surgical dataset, under a common taxonomy and providing dense pixel-level instrument segmentation alongside workflow annotations (phase, step). The study demonstrates that instruction-tuning a single model on SurgMLLMBench yields robust cross-domain performance and generalizes to unseen data, enabling interactive VQA grounded in pixel-level evidence. A reproducible integration pipeline and VQA templates facilitate consistent evaluation and future development of interactive surgical reasoning models. Overall, the work advances intraoperative AI by enabling grounded multimodal reasoning, cross-domain generalization, and interpretable visual explanations for surgical education, assistance, and robotics.

Abstract

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

TL;DR

SurgMLLMBench presents a unified multimodal benchmark for surgical scene understanding by integrating six datasets, including the new MAVIS micro-surgical dataset, under a common taxonomy and providing dense pixel-level instrument segmentation alongside workflow annotations (phase, step). The study demonstrates that instruction-tuning a single model on SurgMLLMBench yields robust cross-domain performance and generalizes to unseen data, enabling interactive VQA grounded in pixel-level evidence. A reproducible integration pipeline and VQA templates facilitate consistent evaluation and future development of interactive surgical reasoning models. Overall, the work advances intraoperative AI by enabling grounded multimodal reasoning, cross-domain generalization, and interpretable visual explanations for surgical education, assistance, and robotics.

Abstract

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

Paper Structure

This paper contains 12 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of SurgMLLMBench. Multi-domain surgical datasets, including the newly collected MAVIS, are unified into a multimodal benchmark through taxonomy alignment, label unification, and VQA annotation. The template-based question generator (orange) produces structured VQA pairs using five query types. SurgMLLMBench supports interactive multimodal surgical scene understanding.
  • Figure 2: Overview of the MAVIS dataset collection process.
  • Figure 3: MAVIS dataset distributions.
  • Figure 4: Comparison of dataset composition across tasks.
  • Figure 5: Qualitative visualization results of (a) instrument segmentation and (b, c) workflow recognition via VQA (green: correct, red: incorrect). OMG-LLaVA and LLaVA denote models trained individually on each dataset, whereas OMG-LLaVA§ and LLaVA§ represent a single model trained on SurgMLLMBench without additional per-dataset fine-tuning.