Table of Contents
Fetching ...

Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain

Zhongxi Qiu, Zhang Zhang, Yan Hu, Heng Li, Jiang Liu

TL;DR

This work investigates data selection for Reinforcement Learning with Verified Rewards (RLVR) in the medical domain by comparing four MedQA-USMLE sampling strategies anchored to the Gemma-3-12b-it base model. It introduces Group Relative Policy Optimization (GRPO) and a model-based data filtering pipeline (Phi-4, Gemma-27b-it, Gemma-12b-it) to curate training data, evaluated across MMLU, CMMLU, MMLU-Pro, and GSM8K. The findings show that filtering generally yields performance gains over random sampling, with self-filtered data boosting medical-domain metrics but at the cost of robustness, while filtering with larger models improves robustness across benchmarks and languages. The study highlights the importance of data organization in domain-specific RLVR and suggests future work including tool integration, full-parameter training, and more sophisticated data-selection strategies. The results offer practical guidance for constructing RLVR datasets in specialized domains and point to trade-offs between domain specialization and cross-benchmark robustness.

Abstract

This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.

Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain

TL;DR

This work investigates data selection for Reinforcement Learning with Verified Rewards (RLVR) in the medical domain by comparing four MedQA-USMLE sampling strategies anchored to the Gemma-3-12b-it base model. It introduces Group Relative Policy Optimization (GRPO) and a model-based data filtering pipeline (Phi-4, Gemma-27b-it, Gemma-12b-it) to curate training data, evaluated across MMLU, CMMLU, MMLU-Pro, and GSM8K. The findings show that filtering generally yields performance gains over random sampling, with self-filtered data boosting medical-domain metrics but at the cost of robustness, while filtering with larger models improves robustness across benchmarks and languages. The study highlights the importance of data organization in domain-specific RLVR and suggests future work including tool integration, full-parameter training, and more sophisticated data-selection strategies. The results offer practical guidance for constructing RLVR datasets in specialized domains and point to trade-offs between domain specialization and cross-benchmark robustness.

Abstract

This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.

Paper Structure

This paper contains 14 sections, 2 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: The prompt used to make a response for a sample.
  • Figure 2: Template of the prompt during the training.
  • Figure 3: Results on medicine domain. (a) Results of medicine domain-related subsets from MMLU. (b) Results of medicine domain-related subsets from CMMLU. (c) Results of medicine domain-related subsets from MMLU-Pro. (d) Radar chart of the results
  • Figure 4: Results across MMLU's four main categories.
  • Figure 5: Performance breakdown for the Humanities category from the MMLU benchmark.
  • ...and 10 more figures