Table of Contents
Fetching ...

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

Baichuan-M3 Team, :, Chengfeng Dou, Fan Yang, Fei Li, Jiyuan Jia, Qiang Ju, Shuai Wang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Hongda Zhang, Jinyang Tai, Linzhuang Sun, Peidong Guo, Yichuan Mo, Xiaochuan Wang, Hengfu Cui, Zhishou Zhang

TL;DR

Baichuan-M3 tackles the gap between conversational QA and clinical decision support by modeling the full clinical workflow within a medical-enhanced LLM. It introduces a three-stage training framework—Task RL, offline policy distillation, and multi-teacher online distillation—coupled with Segmented Pipeline RL to address long-horizon reasoning and information gathering. The system integrates dynamic rubric evolution and fact-aware reinforcement learning to suppress hallucinations while preserving diagnostic reasoning, achieving state-of-the-art results on HealthBench, HealthBench-Hallu, and ScanBench. The work demonstrates practical impact through improved clinical inquiry, diagnostic accuracy, and safety, with public release and inference optimizations to support deployment in real-world healthcare settings.

Abstract

We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at https://huggingface.co/collections/baichuan-inc/baichuan-m3.

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

TL;DR

Baichuan-M3 tackles the gap between conversational QA and clinical decision support by modeling the full clinical workflow within a medical-enhanced LLM. It introduces a three-stage training framework—Task RL, offline policy distillation, and multi-teacher online distillation—coupled with Segmented Pipeline RL to address long-horizon reasoning and information gathering. The system integrates dynamic rubric evolution and fact-aware reinforcement learning to suppress hallucinations while preserving diagnostic reasoning, achieving state-of-the-art results on HealthBench, HealthBench-Hallu, and ScanBench. The work demonstrates practical impact through improved clinical inquiry, diagnostic accuracy, and safety, with public release and inference optimizations to support deployment in real-world healthcare settings.

Abstract

We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at https://huggingface.co/collections/baichuan-inc/baichuan-m3.
Paper Structure (63 sections, 15 equations, 12 figures, 6 tables)

This paper contains 63 sections, 15 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: An illustration for the three stage pipeline.
  • Figure 2: Segmented Pipeline RL (left) and Policy Learning Algorithm (right).
  • Figure 3: Fact-Aware Reinforcement Learning Algorithm.
  • Figure 4: Overall performance comparison on ScanBench.
  • Figure 5: Detailed breakdown of Inquiry Capabilities.
  • ...and 7 more figures