Table of Contents
Fetching ...

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Weihai Zhi, Jiayan Guo, Shangyang Li

TL;DR

MedGR$^2$ tackles data scarcity in medical vision-language reasoning by introducing a self-improving loop that jointly learns a data generator and a reward model to synthesize high-quality multimodal training data. It employs a two-stage policy optimization (reward-filtered SFT followed by GRPO) to achieve strong cross-modality and cross-task generalization, outperforming larger foundation models with a compact 7B parameter. Empirical results on OmniMedVQA demonstrate state-of-the-art performance and data efficiency, validating the synergy between synthetic, reward-informed data and reinforcement learning. This work proposes a paradigm shift from data curation to automated data generation for RL in high-stakes clinical reasoning, enabling scalable, generalizable medical AI.

Abstract

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

TL;DR

MedGR tackles data scarcity in medical vision-language reasoning by introducing a self-improving loop that jointly learns a data generator and a reward model to synthesize high-quality multimodal training data. It employs a two-stage policy optimization (reward-filtered SFT followed by GRPO) to achieve strong cross-modality and cross-task generalization, outperforming larger foundation models with a compact 7B parameter. Empirical results on OmniMedVQA demonstrate state-of-the-art performance and data efficiency, validating the synergy between synthetic, reward-informed data and reinforcement learning. This work proposes a paradigm shift from data curation to automated data generation for RL in high-stakes clinical reasoning, enabling scalable, generalizable medical AI.

Abstract

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR), a novel framework that creates a self-improving virtuous cycle. MedGR co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.

Paper Structure

This paper contains 29 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the MedGR$^2$ framework. Our self-improving framework operates in three sequential stages. Stage 1: VQA Generation. A VLM generator, guided by one of three prompt engineering strategies (Direct, Step-by-Step, Meta-cognitive), synthesizes multimodal VQA triplets $(I, q, a)$ from various medical images. Our findings indicate the Meta-cognitive prompt yields the highest quality data. Stage 2: Reward Model for Alignment. A dynamic Reward Model, visualized as a "Clinical Evaluation Prism," assesses each generated triplet based on four key criteria (Factual Accuracy, Formatted Accuracy, Reasoning Soundness, Instruction Relevance). This model continually adapts using new data, providing a robust quality signal. Stage 3: Two-Stage Policy Optimization. The final reasoning policy is trained in a two-stage process. First, it receives a Warm Start via Reward-Filtered SFT using high-quality data from Stage 2. Second, it is further optimized to achieve superior Generalization through RL (GRPO), creating a highly capable and robust medical reasoning model.
  • Figure 2: The synergy of reward-filtered data and Reinforcement Learning across varying data scales. We compare four training strategies: 'RandK' (SFT on K randomly sampled generated data); 'TopK' (SFT on the top K samples selected by our reward model; 'Rand11000' (a baseline SFT on 11,000 unfiltered samples); and 'TopK+Rand' (MedGR$^2$ with GRPO applied on the 'TopK' data).
  • Figure 3: Distribution of modalities in all training dataset.
  • Figure 5: Head-to-Head Error Transition Analysis. Each bar dissects the entire question set for a modality into four transition flows based on the outcomes of the two models. The four flows are: Correct$\rightarrow$Correct (green), where both models succeed; Wrong$\rightarrow$Correct (blue), where MedGR$^2$ corrects the baseline's error; Correct$\rightarrow$Wrong (orange), where our model makes a new error on a question the baseline answered correctly; and Wrong$\rightarrow$Wrong (red), where both models fail.
  • Figure : (a) Cross-task transfer performance.
  • ...and 2 more figures