Table of Contents
Fetching ...

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Che Liu, Haozhe Wang, Jiazhen Pan, Zhongwei Wan, Yong Dai, Fangzhen Lin, Wenjia Bai, Daniel Rueckert, Rossella Arcucci

TL;DR

This work addresses the challenge of medically grounded reasoning in LLMs without relying on supervised fine-tuning with distilled chain-of-thought data. It introduces AlphaMed, a medical LLM trained solely through minimalist rule-based reinforcement learning on public MCQ datasets, using Group Relative Policy Optimization and binary is_correct rewards. The results show state-of-the-art performance across six medical QA benchmarks, including superior performance on hard/advanced tasks compared to larger or closed-source models, highlighting emergent reasoning without CoT supervision. A data-centric analysis reveals that dataset informativeness and a mix of difficulty levels are key to inducing robust reasoning, while current benchmarks may inadequately capture true reasoning progress. The work suggests practical paths for scalable, interpretable medical LLMs and motivates the development of more challenging reasoning-oriented benchmarks.

Abstract

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

TL;DR

This work addresses the challenge of medically grounded reasoning in LLMs without relying on supervised fine-tuning with distilled chain-of-thought data. It introduces AlphaMed, a medical LLM trained solely through minimalist rule-based reinforcement learning on public MCQ datasets, using Group Relative Policy Optimization and binary is_correct rewards. The results show state-of-the-art performance across six medical QA benchmarks, including superior performance on hard/advanced tasks compared to larger or closed-source models, highlighting emergent reasoning without CoT supervision. A data-centric analysis reveals that dataset informativeness and a mix of difficulty levels are key to inducing robust reasoning, while current benchmarks may inadequately capture true reasoning progress. The work suggests practical paths for scalable, interpretable medical LLMs and motivates the development of more challenging reasoning-oriented benchmarks.

Abstract

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

Paper Structure

This paper contains 29 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Performance comparison on six medical QA benchmarks. Our models are initialized with Llama3.1-8B-Instructllama3 and trained using minimalist rule-based RL on one of three balanced subsets: MedQA-Sub, MedMCQA-Sub, or PubMedQA-Sub (shown as blue, green, and orange bars, respectively). Despite using only 1,200 examples per subset, all variants of our model achieve substantial improvements over the base Llama3.1-8B-Instruct and match or surpass the strong baseline HuatuoGPT-o1-8B across all benchmarks.
  • Figure 2: Dataset analysis and training dynamics.Left: Ratio of effective queries over training steps; each curve corresponds to models trained on a specific subset. Middle: Training reward per step for models trained on each subset. Right: Distribution of question lengths (number of tokens) in MedQA, MedMCQA, and PubMedQA medqamedmcqajin2019pubmedqa.
  • Figure 3: Effect of data quantity. Average accuracy across six medical QA benchmarks as the number of samples per level increases from 200 to 400, resulting in the total subset size growing from 1,200 to 2,400 examples. Scaling MedQA-Sub and MedMCQA-Sub leads to consistent performance gains, highlighting the value of informative data. In contrast, PubMedQA-Sub shows no improvement, reflecting the limitations of low-informative data sources.
  • Figure 4: Effect of data diversity. Average accuracy across six medical QA benchmarks when models are trained individually on single or combined subsets. Adding MedMCQA-Sub to MedQA-Sub boosts performance, while further adding PubMedQA-Sub reduces it, suggesting that less informative data can negate the benefits of increased diversity.
  • Figure 5: Performance on six benchmarks when training on subsets with increasing difficulty levels (L1 to L6). Each blue dot represents a separately trained model on a subset that includes all data up to the indicated difficulty level; new data are incorporated only through separate training runs, not incrementally during training. While performance on MedXpert zuo2025medxpertqa increases consistently, trends on other benchmarks vary. Final models trained on the full set (L1--L6) generally achieve comparable or superior performance to HuatuoGPT-o1-8B chen2024huatuogpt.
  • ...and 8 more figures