Table of Contents
Fetching ...

MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment

Kailong Fan, Anqi Pu, Yichen Wu, Wanhua Li, Yicong Li, Hanspeter Pfister, Huafeng Liu, Xiang Li, Quanzheng Li, Ning Guo

TL;DR

This work proposes a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization and establishes that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems.

Abstract

Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model's parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems

MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment

TL;DR

This work proposes a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization and establishes that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems.

Abstract

Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model's parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems
Paper Structure (11 sections, 9 equations, 3 figures, 2 tables)

This paper contains 11 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the MAPLE framework. Given a test question, the policy model generates $M$ candidate reasoning chains via multi-sample generation. Each chain is evaluated by a PRM, which assigns step-level scores $s_{i,j}$ to every intermediate reasoning step. The scored candidates are aggregated through a Self-Consistency with Reward Model reranking (SC+RM) mechanism to produce a pseudo label $\hat{y}$. Per-sampel rewards $R(y_i, \hat{y})$ are then computed by the candidate and pseudo label, and the resulting reward signals are used to update the policy model online via policy optimization.
  • Figure 2: Performance comparison across four medical QA benchmarks. Built on Llama3.1 (8B) as backbone, MAPLE consistently outperforms its base model and surpasses QwQ (32B) on DDXPlus and MMLU-Med despite being 4× smaller in model size. Italic green values indicate MAPLE's absolute accuracy gain over the Llama3.1 (MV) backbone.
  • Figure 3: Test-time scaling curves on MedMCQA under : MV, BoM, and SC+RM. MAPLE (red) consistently outperforms the Llama3.1-8B backbone (blue) across different rollout budgets M , with the shaded region highlighting the performance gap.