Table of Contents
Fetching ...

GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

Yanzhou Su, Tianbin Li, Jiyao Liu, Chenglong Ma, Junzhi Ning, Cheng Tang, Sibo Ju, Jin Ye, Pengcheng Chen, Ming Hu, Shixiang Tang, Lihao Liu, Bin Fu, Wenqi Shao, Xiaowei Hu, Xiangwen Liao, Yuanfeng Ji, Junjun He

TL;DR

<3-5 sentence high-level summary>Problem: existing medical multimodal AI often relies on supervised fine-tuning, which emphasizes memorization over robust reasoning. Approach: the authors propose GMAI-VL-R1 with reinforcement learning tuning (RLT) and a high-quality reasoning dataset (GMAI-Reasoning10K) to enhance chain-of-thought reasoning and reflection, evaluated via GRPO-based policy optimization. Contributions: (i) GMAI-VL-R1 with RLT, (ii) GMAI-Reasoning10K comprising 10k CoT-annotated VQA pairs across 12 modalities, and (iii) extensive benchmarks showing improved generalization and efficiency over SFT baselines. Significance: demonstrates that RL-driven reasoning can yield more robust, data-efficient performance in medical decision-support tasks, with potential for broader clinical deployment.

Abstract

Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.

GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

TL;DR

<3-5 sentence high-level summary>Problem: existing medical multimodal AI often relies on supervised fine-tuning, which emphasizes memorization over robust reasoning. Approach: the authors propose GMAI-VL-R1 with reinforcement learning tuning (RLT) and a high-quality reasoning dataset (GMAI-Reasoning10K) to enhance chain-of-thought reasoning and reflection, evaluated via GRPO-based policy optimization. Contributions: (i) GMAI-VL-R1 with RLT, (ii) GMAI-Reasoning10K comprising 10k CoT-annotated VQA pairs across 12 modalities, and (iii) extensive benchmarks showing improved generalization and efficiency over SFT baselines. Significance: demonstrates that RL-driven reasoning can yield more robust, data-efficient performance in medical decision-support tasks, with potential for broader clinical deployment.

Abstract

Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.

Paper Structure

This paper contains 32 sections, 4 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Quantitative comparison of the performance of different models across various benchmarks. The results show that, in most benchmarks, the RLT-based model outperforms the SFT-based model.
  • Figure 2: The framework of reinforcement learning tuning. Given the input medical image (chest X-ray) and question, the policy model generates multiple reasoning responses through group sampling. The reasoning steps are then evaluated based on accuracy, format, and repetition rewards. This process updates the policy model, guiding it towards more accurate diagnoses. The final answer, "Normal Chest X-ray," is generated based on the reasoning process that identifies pulmonary tuberculosis indicators.
  • Figure 3: Modality distribution of the curated GMAI-Reasoning10K dataset. GMAI-Reasoning10K provides high-quality 10K visual question answering pairs spanning 12 different medical modalities.
  • Figure 4: Case study illustrating the model's reasoning ability under Reinforcement Learning Tuning (RLT). Given medical images, the model identifies the most accurate diagnosis based on visible symptoms. RLT encourages the model to engage in reasoning and select the correct answer from multiple choices.
  • Figure 5: Distribution of generated answer lengths (in word count) for the Baseline, +SFT, and +RLT models. Each histogram displays the total number of answers (light bars) and correct answers (dark bars).