OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Minjie Ju, PeterYM Woo, Yixuan Yuan
TL;DR
OmniBrainBench addresses the need for a comprehensive, clinically aligned multimodal brain-imaging benchmark by assembling 15 modalities and 15 multi-stage tasks across five clinical phases, validated by expert radiologists. It introduces a three-stage construction pipeline (data collection, question augmentation, and filtering) to produce 9,527 clinically verified VQA pairs over 31,706 images, enabling rigorous evaluation of 24 MLLMs across closed- and open-ended VQA. Results show a persistent gap between current MLLMs and physician performance, with notable modality- and task-dependent variability, underscoring the necessity for domain-specific pretraining and robust reasoning capabilities. The benchmark provides an open, multi-faceted platform for assessing and guiding progress toward clinically reliable brain-imaging MLLMs, while acknowledging safety and validation needs before real-world deployment.
Abstract
Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual question-answering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like GPT-5 (63.37%) outperform others yet lag far behind physicians (91.35%), while medical ones show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all ones fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians. We publicly release our benchmark at link.
