Table of Contents
Fetching ...

Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

Kaitao Chen, Mianxin Liu, Daoming Zong, Chaoyue Ding, Shaohao Rui, Yankai Jiang, Mu Zhou, Xiaosong Wang

TL;DR

MedOrch tackles multimodal medical decision-making by coordinating open-source vision-language models through an LLM mediator to simulate reflective medical reasoning. The framework uses three roles—expert VLMs, mediator, and judge—to iteratively refine outputs via Socratic questioning and final judgment without additional training. Experiments on five VQA benchmarks show substantial performance gains over single models and competitive results against GPT-4V-based approaches, with robustness to underperforming agents. The results suggest that mediator-guided collaboration can unlock the strengths of heterogeneous open-source models for clinically relevant multimodal intelligence.

Abstract

Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs' ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence.

Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

TL;DR

MedOrch tackles multimodal medical decision-making by coordinating open-source vision-language models through an LLM mediator to simulate reflective medical reasoning. The framework uses three roles—expert VLMs, mediator, and judge—to iteratively refine outputs via Socratic questioning and final judgment without additional training. Experiments on five VQA benchmarks show substantial performance gains over single models and competitive results against GPT-4V-based approaches, with robustness to underperforming agents. The results suggest that mediator-guided collaboration can unlock the strengths of heterogeneous open-source models for clinically relevant multimodal intelligence.

Abstract

Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs' ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence.

Paper Structure

This paper contains 15 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The mediator-guided multi-agent collaboration framework for medical VQA. In the initial stage, multiple VLM expert agents independently generate preliminary answers to a given question. To promote deeper interaction, the mediator agent synthesizes the information and formulates Socratic questions for expert agents. Subsequently, the relevant expert agents reflect on the question and generate refined responses accordingly. Finally, a judge agent analyzes the dialogues between the mediator and expert agents and achieves a systematic output.
  • Figure 2: Comparative results across different agent configurations. PathVQA and PMC-VQA are selected for these ablation studies as they are relatively more challenging and could provide presentative evaluations.
  • Figure 3: Comparison with GPT-4V and other multi-agent methods based on GPT-4V on PathVQA dataset.
  • Figure 4: Evidence of collaborative synergy in MedOrch on PathVQA dataset.