Table of Contents
Fetching ...

Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning

Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, Hoifung Poon

TL;DR

This work extends reinforcement learning from verifiable rewards (RLVR) to the medical domain by applying it to MCQA data with a 3B base model. Using a PPO-based framework and a rule-based verifier reward, Med-RLVR achieves in-distribution parity with supervised fine-tuning and substantially better out-of-distribution generalization (about +8 points). The study also traces the training dynamics, revealing emergent reasoning within the base model without explicit supervision, and documents a multi-stage progression of reasoning patterns, including reward-hacking behaviors. While promising, the work notes limitations of MCQA, calls for multimodal and more complex medical tasks, and outlines directions for mitigating reward hacking and extending to real-world medical reasoning. Overall, Med-RLVR demonstrates the potential of RLVR to elicit domain-specific reasoning in knowledge-intensive fields beyond math and coding.

Abstract

Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning capabilitie from base language models without explicit reasoning supervisions, as demonstrated by DeepSeek-R1. While prior work on RLVR has primarily focused on mathematical and coding domains, its applicability to other tasks and domains remains unexplored. In this work, we investigate whether medical reasoning can emerge from RLVR. We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels. Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering. Notably, Med-RLVR achieves performance comparable to traditional supervised fine-tuning (SFT) on in-distribution tasks while significantly improving out-of-distribution generalization, with an 8-point accuracy gain. Further analysis of training dynamics reveals that, with no explicit reasoning supervision, reasoning emerges from the 3B-parameter base model. These findings underscore the potential of RLVR in domains beyond math and coding, opening new avenues for its application in knowledge-intensive fields such as medicine.

Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning

TL;DR

This work extends reinforcement learning from verifiable rewards (RLVR) to the medical domain by applying it to MCQA data with a 3B base model. Using a PPO-based framework and a rule-based verifier reward, Med-RLVR achieves in-distribution parity with supervised fine-tuning and substantially better out-of-distribution generalization (about +8 points). The study also traces the training dynamics, revealing emergent reasoning within the base model without explicit supervision, and documents a multi-stage progression of reasoning patterns, including reward-hacking behaviors. While promising, the work notes limitations of MCQA, calls for multimodal and more complex medical tasks, and outlines directions for mitigating reward hacking and extending to real-world medical reasoning. Overall, Med-RLVR demonstrates the potential of RLVR to elicit domain-specific reasoning in knowledge-intensive fields beyond math and coding.

Abstract

Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning capabilitie from base language models without explicit reasoning supervisions, as demonstrated by DeepSeek-R1. While prior work on RLVR has primarily focused on mathematical and coding domains, its applicability to other tasks and domains remains unexplored. In this work, we investigate whether medical reasoning can emerge from RLVR. We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels. Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering. Notably, Med-RLVR achieves performance comparable to traditional supervised fine-tuning (SFT) on in-distribution tasks while significantly improving out-of-distribution generalization, with an 8-point accuracy gain. Further analysis of training dynamics reveals that, with no explicit reasoning supervision, reasoning emerges from the 3B-parameter base model. These findings underscore the potential of RLVR in domains beyond math and coding, opening new avenues for its application in knowledge-intensive fields such as medicine.

Paper Structure

This paper contains 19 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An Overview of Med-RLVR (See \ref{['sec:method']} for the details).
  • Figure 2: The training dynamics of Med-RLVR (See \ref{['sec:pattern-shifts']} for the details).
  • Figure 3: Comparing Med-RLVR and SFT on in-distribution and out-of-distribution tasks. Standard deviation from 1000 bootstrapping sampling procedures tibshirani1993introduction is reported as error bars.