Table of Contents
Fetching ...

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Won Park

TL;DR

MDAgents introduces an adaptive, task-aware multi-agent framework for medical decision-making that dynamically assigns LLMs to roles and teams (PCP, MDT, ICT) based on medical complexity. By integrating a moderator-driven complexity classification, expert recruitment, and structured analysis with final ensemble-based decision-making, the approach achieves superior accuracy on 7 of 10 real-world medical benchmarks, including multimodal tasks. Ablation studies show the benefits of complexity-aware routing, moderator reviews, and retrieval-augmented knowledge, while efficiency analyses reveal favorable API-cost-performance trade-offs. The work demonstrates practical potential for scalable, collaborative AI-assisted medical diagnosis, while acknowledging limitations and proposing future enhancements such as domain-specific foundation models and patient-centered extensions.

Abstract

Foundation models are becoming valuable tools in medicine. Yet despite their promise, the best way to leverage Large Language Models (LLMs) in complex medical tasks remains an open question. We introduce a novel multi-agent framework, named Medical Decision-making Agents (MDAgents) that helps address this gap by automatically assigning a collaboration structure to a team of LLMs. The assigned solo or group collaboration structure is tailored to the medical task at hand, emulating real-world medical decision-making processes adapted to tasks of varying complexities. We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and medical diagnosis benchmarks, including a comparison of LLMs' medical complexity classification against human physicians. MDAgents achieved the best performance in seven out of ten benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant improvement of up to 4.2% (p < 0.05) compared to previous methods' best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks. Notably, the combination of moderator review and external medical knowledge in group collaboration resulted in an average accuracy improvement of 11.8%. Our code can be found at https://github.com/mitmedialab/MDAgents.

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

TL;DR

MDAgents introduces an adaptive, task-aware multi-agent framework for medical decision-making that dynamically assigns LLMs to roles and teams (PCP, MDT, ICT) based on medical complexity. By integrating a moderator-driven complexity classification, expert recruitment, and structured analysis with final ensemble-based decision-making, the approach achieves superior accuracy on 7 of 10 real-world medical benchmarks, including multimodal tasks. Ablation studies show the benefits of complexity-aware routing, moderator reviews, and retrieval-augmented knowledge, while efficiency analyses reveal favorable API-cost-performance trade-offs. The work demonstrates practical potential for scalable, collaborative AI-assisted medical diagnosis, while acknowledging limitations and proposing future enhancements such as domain-specific foundation models and patient-centered extensions.

Abstract

Foundation models are becoming valuable tools in medicine. Yet despite their promise, the best way to leverage Large Language Models (LLMs) in complex medical tasks remains an open question. We introduce a novel multi-agent framework, named Medical Decision-making Agents (MDAgents) that helps address this gap by automatically assigning a collaboration structure to a team of LLMs. The assigned solo or group collaboration structure is tailored to the medical task at hand, emulating real-world medical decision-making processes adapted to tasks of varying complexities. We evaluate our framework and baseline methods using state-of-the-art LLMs across a suite of real-world medical knowledge and medical diagnosis benchmarks, including a comparison of LLMs' medical complexity classification against human physicians. MDAgents achieved the best performance in seven out of ten benchmarks on tasks requiring an understanding of medical knowledge and multi-modal reasoning, showing a significant improvement of up to 4.2% (p < 0.05) compared to previous methods' best performances. Ablation studies reveal that MDAgents effectively determines medical complexity to optimize for efficiency and accuracy across diverse medical tasks. Notably, the combination of moderator review and external medical knowledge in group collaboration resulted in an average accuracy improvement of 11.8%. Our code can be found at https://github.com/mitmedialab/MDAgents.

Paper Structure

This paper contains 61 sections, 1 equation, 22 figures, 13 tables.

Figures (22)

  • Figure 1: Medical Decision-Making Agents (MDAgents) framework. Given a medical query from different medical datasets, the framework performs 1) medical complexity check, 2) recruitment, 3) analysis and synthesis, and 4) decision-making steps.
  • Figure 2: Illustrative example of MDAgents in a moderate complexity case from the PMC-VQA dataset. More detailed case studies can be found in Figure \ref{['fig:case_study']} and \ref{['fig:case_study2']} in the Appendix.
  • Figure 3: Experiment with the MedQA dataset (N=25 randomly sampled questions). (a) LLM's capability to classify complexity. (b-d) Evaluating 25 medical problems by solving each one 10 times at various complexity levels. The x-axis represents the accuracy achieved for each problem, while the y-axis shows the number of problems that reached that level of accuracy.
  • Figure 4: Our method outperforms Solo and Group settings across different medical benchmarks.
  • Figure 5: Impact of complexity selection of the query. Accuracy of each ablation on text-only (left), text+image (center) and text+video (right) benchmarks are reported.
  • ...and 17 more figures