A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Jie Liu; Wenxuan Wang; Yihang Su; Jingyuan Huan; Wenting Chen; Yudi Zhang; Cheng-Yi Li; Kao-Jung Chang; Xiaohan Xin; Linlin Shen; Michael R. Lyu

A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Jie Liu, Wenxuan Wang, Yihang Su, Jingyuan Huan, Wenting Chen, Yudi Zhang, Cheng-Yi Li, Kao-Jung Chang, Xiaohan Xin, Linlin Shen, Michael R. Lyu

TL;DR

Asclepius introduces a spectrum Med-MLLM benchmark spanning 15 specialties and 8 capacities with 3,232 original multi-modal QA items and a server-based evaluation platform to compare 6 Med-MLLMs against human doctors. The study reveals that generalist Med-MLLMs like GPT-4V outperform specialized models but still lag behind human clinicians, with multi-modal fusion and long-range instruction capture remaining key bottlenecks. By systematically analyzing specialty and capacity performance, the benchmark provides a rigorous, scalable framework to drive safe deployment and targeted improvements in medical AI. The work highlights both the breadth advantages of Med-MLLMs and their persistent gaps in diagnostic precision, modality fusion, and end-to-end clinical reasoning.

Abstract

The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the complexity of real-world diagnostics across diverse specialties. To address this gap, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs' capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.

A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

TL;DR

Abstract

Paper Structure (35 sections, 15 figures, 7 tables)

This paper contains 35 sections, 15 figures, 7 tables.

Introduction
Related Work
Medical Multi-Modal Large Language Models (Med-MLLMs)
Benchmark for Med-MLLMs
Asclepius Benchmark
Multi-Specialty Coverage
Multi-Dimensional Capacity
Data Collection
Experiments
Implementation
Visual and Textual Modality Study
Results across Specialties
Results across Capacities
Discussion
Conclusion
...and 20 more sections

Figures (15)

Figure 1: Asclepius, a spectrum evaluation benchmark for Med-MLLMs, analyzes models on the capacity dimension with 8 clinical tasks and the specialty dimension with 15 medical specialties.
Figure 2: Asclepius Overview. (a) Involves 15 specialties and 79 body parts and organs in total, representing the critical component of the healthcare system. (b) Shows examples for 8 distinct capacities, offering a multifaceted evaluation of Med-MLLMs.
Figure 3: Data Statistics for Specialty. Currently, Asclepius incorporates 15 specialties with 3,232 multi-modal questions.
Figure 4: Data Statistics for Capacities. Asclepius includes two layers of capacity dimensions, which encompass 8 sub-capacities.
Figure 5: The spectrum of Med-MLLMs in Specialties. Green circle size shows accuracy variance across specialties; larger circles indicate higher variance. Darker squares represent higher accuracy. Numeric details in Appendix Table \ref{['table:models_accuracy']}. Meta doctor is ensembled from several doctors whose area of expertise cover these 15 specialties.
...and 10 more figures

A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

TL;DR

Abstract

A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)