Table of Contents
Fetching ...

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

Ziyang Song, Zelin Zang, Xiaofan Ye, Boqiang Xu, Long Bai, Jinlin Wu, Hongliang Ren, Hongbin Liu, Jiebo Luo, Zhen Lei

TL;DR

NeuroABench introduces a dedicated multimodal benchmark for neurosurgical anatomy identification, built from 89 curated videos and 32 teaching manuals to assess 68 anatomical structures across 32 approaches. The framework includes a rigorous data-collection, annotation, and QA-generation pipeline, producing 1079 QA pairs for evaluation. Zero-shot experiments across 10+ MLLMs and four neurosurgical trainees reveal a substantial performance gap, with the best model at 40.87% accuracy while trainees average higher, underscoring the need for anatomy-focused training and robust reasoning in AI for neurosurgery. This work establishes a standardized, clinically grounded benchmark to guide development of AI systems capable of reliable intraoperative anatomical understanding.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.

NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

TL;DR

NeuroABench introduces a dedicated multimodal benchmark for neurosurgical anatomy identification, built from 89 curated videos and 32 teaching manuals to assess 68 anatomical structures across 32 approaches. The framework includes a rigorous data-collection, annotation, and QA-generation pipeline, producing 1079 QA pairs for evaluation. Zero-shot experiments across 10+ MLLMs and four neurosurgical trainees reveal a substantial performance gap, with the best model at 40.87% accuracy while trainees average higher, underscoring the need for anatomy-focused training and robust reasoning in AI for neurosurgery. This work establishes a standardized, clinically grounded benchmark to guide development of AI systems capable of reliable intraoperative anatomical understanding.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Examples of NeuroABench. This benchmark is constructed using neurosurgical anatomical content derived from publicly available educational videos. Each video is paired with a question designed to query the identification of specific neuroanatomical structures. For every question, a set of candidate anatomical structures (e.g., skin, anterior cerebral artery, temporal lobe) is provided as multiple-choice options.
  • Figure 2: Pipeline illustration of NeuroABench. The data collection can be divided into three main steps: 1) We search hundreds of videos and teaching manuals from the Neurosurgical Atlas, then keep 89 high-quality videos and 32 teaching manuals after filtering. 2) We use Gemini-1.5-Pro to annotate videos with the instructions of clinician-reviewed structured progress extracted from teaching manuals. 3) The annotated images go through additional validation and experts' selection. From these images, we generate question-and-answer pairs for each landmark anatomy featured in the videos.
  • Figure 3: A case on the influence of anatomical deformation. Here, we select the responses of Claude-3.5-Sonnet to two closely similar frames of anatomical images from the same surgical procedure to demonstrate the impact of anatomical deformation on the model's anatomical recognition.