Table of Contents
Fetching ...

Baichuan-M2: Scaling Medical Capability with Large Verifier System

Baichuan-M2 Team, :, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang

TL;DR

The paper addresses the mismatch between medical LLM benchmarks and real-world clinical decision-making by introducing a dynamic, interactive verifier system consisting of a Patient Simulator and a Clinical Rubrics Generator. Baichuan-M2, a 32B medical augmented reasoning model, is trained via a multi-stage reinforcement learning framework with an enhanced GRPO algorithm to align reasoning with expert clinical judgment. Evaluations on HealthBench and Chinese clinical settings show Baichuan-M2 achieving state-of-the-art or near-state-of-the-art performance at a fraction of the parameter size of many competitors, highlighting the value of robust dynamic verification for practical deployment. The work demonstrates that deploying an interactive verification loop can produce clinically capable LLMs at deployable scales, offering a path toward safer and more effective AI-assisted medical decision-making.

Abstract

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.

Baichuan-M2: Scaling Medical Capability with Large Verifier System

TL;DR

The paper addresses the mismatch between medical LLM benchmarks and real-world clinical decision-making by introducing a dynamic, interactive verifier system consisting of a Patient Simulator and a Clinical Rubrics Generator. Baichuan-M2, a 32B medical augmented reasoning model, is trained via a multi-stage reinforcement learning framework with an enhanced GRPO algorithm to align reasoning with expert clinical judgment. Evaluations on HealthBench and Chinese clinical settings show Baichuan-M2 achieving state-of-the-art or near-state-of-the-art performance at a fraction of the parameter size of many competitors, highlighting the value of robust dynamic verification for practical deployment. The work demonstrates that deploying an interactive verification loop can produce clinically capable LLMs at deployable scales, offering a path toward safer and more effective AI-assisted medical decision-making.

Abstract

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.

Paper Structure

This paper contains 35 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Verifier System Framework
  • Figure 2: An illustration of Patient Simulator. The system is composed of three primary modules: the Termination Gate, the Affective Unit, and the Factual Unit. The Affective Unit was trained using synthetic data to simulate patients with a wide range of personalities and sociocultural backgrounds. Both the Affective Unit and the Factual Unit were implemented via LLMs. These units employ a non-thinking model to quickly determine termination conditions and verify factual information.
  • Figure 3: Patient Simulator Comparison. We observe that the Privacy Score and Fact Score of DeepSeek-V3 exhibit a significant decrease following the incorporation of psychological information. This indicates that employing this model in evaluations may introduce substantial fluctuations in experimental results due to excessive stochastic noise. In contrast, our proposed simulator methodologically achieves an optimal balance between enhancing the Personification Score while preserving both Privacy Score and Fact Score stability.
  • Figure 4: Overview of Training Pipeline.
  • Figure 5: Impact of length penalty. The results demonstrate that the model can effectively compress response length (right) while maintaining performance (left) growth. All results are evaluated on a random subset of HealthBench.
  • ...and 8 more figures