Table of Contents
Fetching ...

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Shujie Liu, Yan Lu, Huaxiu Yao

TL;DR

MMedAgent-RL tackles the generalization gap of single Med-LVLMs by enabling dynamic, reinforcement-learning–driven collaboration among GP and specialist agents in a clinically inspired GP→Specialists→GP loop. It introduces a curriculum-based MARL (C-MARL) framework that first trains a triage GP, then uses specialist outputs, and finally trains an attending physician to balance imitation and correction of expert judgments via GRPO. Across five medical VQA benchmarks, it achieves state-of-the-art performance, with an average 20.7% gain over supervised fine-tuning baselines, and demonstrates human-like, stepwise reasoning patterns. The work shows strong in-domain and out-of-domain generalization and points to a scalable path for robust multimodal medical reasoning.

Abstract

Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 20.7% over supervised fine-tuning baselines.

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

TL;DR

MMedAgent-RL tackles the generalization gap of single Med-LVLMs by enabling dynamic, reinforcement-learning–driven collaboration among GP and specialist agents in a clinically inspired GP→Specialists→GP loop. It introduces a curriculum-based MARL (C-MARL) framework that first trains a triage GP, then uses specialist outputs, and finally trains an attending physician to balance imitation and correction of expert judgments via GRPO. Across five medical VQA benchmarks, it achieves state-of-the-art performance, with an average 20.7% gain over supervised fine-tuning baselines, and demonstrates human-like, stepwise reasoning patterns. The work shows strong in-domain and out-of-domain generalization and points to a scalable path for robust multimodal medical reasoning.

Abstract

Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 20.7% over supervised fine-tuning baselines.

Paper Structure

This paper contains 25 sections, 1 equation, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of Med-Agent paradigms: single-agent $\rightarrow$ static workflows $\rightarrow$ dynamic collaboration. (a) Motivation: Single-agent models struggle with domain specialization, and prior multi-agent systems rely on fixed workflows, limiting adaptability. We propose a trainable reasoning-enhanced multi-agent system via RL. (b) Performance: Our method is highly competitive across multiple benchmarks.
  • Figure 2: Overview of MMedAgent-RL, a RL-driven multi-agent framework designed to enhance the multimodal medical reasoning. It simulates the clinical loop of General Practitioner (GP) $\rightarrow$ Specialists $\rightarrow$ GP. MMedAgent-RL uses GRPO guo2025deepseek to optimize the triage doctor (the first GP) in order to improve triage accuracy. Then, powerful proprietary LVLMs are used as the specialist doctors for the assigned department. Finally, curriculum learning bengio2009curriculumpentina2015curriculum and RL are combined to progressively train the attending physician (the second GP), who integrates the diverse opinions of specialists and makes robust decisions under varying levels of expert reliability.
  • Figure 3: (a) The average performance of both general practitioners (GP) and specialists is below 70%, which suggests a misalignment issue in multi-agent collaboration. Over-reliance on specialists' opinions or unilateral decisions by GPs can both lead to suboptimal outcomes. (b) Since specialists perform inconsistently across different cases, this poses a challenge for GPs when making decisions. Using all data for reinforcement fine-tuning can easily trap the model in a locally suboptimal solution (left). In contrast, our C-MARL approach enables the model to progressively accomplish sub-goals in a three-stage process and ultimately reach a globally optimal solution (right).
  • Figure 4: Results of different settings of specialist doctors.
  • Figure 5: Results under different levels of decision difficulty.
  • ...and 8 more figures