MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Yunhang Qian; Xiaobin Hu; Jiaquan Yu; Siyang Xin; Xiaokun Chen; Jiangning Zhang; Peng-Tao Jiang; Jiawei Liu; Hongwei Bran Li

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li

TL;DR

This work presents MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems, providing a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems.

Abstract

While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

TL;DR

Abstract

Paper Structure (16 sections, 18 figures, 4 tables)

This paper contains 16 sections, 18 figures, 4 tables.

Introduction
Related Work
MedMASLab: A Unified Orchestration Framework
Multimodal Agentic Orchestration
Rethinking MAS Evaluation: From Rules to Semantics
From Rule-Based Matching to Semantic Verification.
Experiments and Analysis
Experiment Setup
Comparison Experiments
Error Analysis
Conclusion
Dataset
Methods
Additional Results
Error examples
...and 1 more sections

Figures (18)

Figure 1: MedMASLab, the first unified orchestration framework designed for medical visual-language multi-agent systems.
Figure 2: Framework of MedMASLab.
Figure 3: Performance and ranking variations of different MAS methods across five evaluation protocols on DxBench, PubMedQA, and MedXpertQA.
Figure 4: Trade-off between performance and token cost of Qwen2.5VL-7B-Instruct based Multi-Agent Methods across Medical Benchmarks.
Figure 5: Comparison of Method Performance and Token Cost using Qwen2.5VL-7B, LLaVA-7B, and GPT-4o-mini as Backbones on MedQA and MedVidQA.
...and 13 more figures

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

TL;DR

Abstract

MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (18)