Table of Contents
Fetching ...

Baichuan-M1: Pushing the Medical Capability of Large Language Models

Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao, Da Pan, Fei Kou, Fei Li, Fuzhong Chen, Guosheng Dong, Han Liu, Hongda Zhang, Jin He, Jinjie Yang, Kangxi Wu, Kegeng Wu, Lei Su, Linlin Niu, Linzhuang Sun, Mang Wang, Pengcheng Fan, Qianli Shen, Rihui Xin, Shunya Dang, Songchi Zhou, Weipeng Chen, Wenjing Luo, Xin Chen, Xin Men, Xionghai Lin, Xuezhen Dong, Yan Zhang, Yifei Duan, Yuyan Zhou, Zhi Ma, Zhiying Wu

TL;DR

Baichuan-M1 addresses the need for effective medical-domain LLMs by training from scratch on 20T tokens with a dedicated medical curriculum and a hybrid architecture that combines global and sliding-window attention. It integrates a three-stage pretraining, extensive data curation (including over 1T tokens from expert sources) and synthetic medical data, then applies comprehensive SFT and RL-alignment (ELO, TDPO, PPO) to produce robust medical reasoning and safety. On open benchmarks, Baichuan-M1-14B-Instruct outperforms strong open baselines and closes the gap to leading proprietary models in medical tasks, while preserving performance on general domains. The work demonstrates the practicality and impact of domain-specific pretraining from scratch and provides an open resource for the research community to advance AI-assisted healthcare.

Abstract

The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.

Baichuan-M1: Pushing the Medical Capability of Large Language Models

TL;DR

Baichuan-M1 addresses the need for effective medical-domain LLMs by training from scratch on 20T tokens with a dedicated medical curriculum and a hybrid architecture that combines global and sliding-window attention. It integrates a three-stage pretraining, extensive data curation (including over 1T tokens from expert sources) and synthetic medical data, then applies comprehensive SFT and RL-alignment (ELO, TDPO, PPO) to produce robust medical reasoning and safety. On open benchmarks, Baichuan-M1-14B-Instruct outperforms strong open baselines and closes the gap to leading proprietary models in medical tasks, while preserving performance on general domains. The work demonstrates the practicality and impact of domain-specific pretraining from scratch and provides an open resource for the research community to advance AI-assisted healthcare.

Abstract

The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.

Paper Structure

This paper contains 49 sections, 7 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: The medical capability of Baichuan-M1-14B compared with other models.
  • Figure 6: The data synthesis pipeline for various sources.
  • Figure 7: KV cache Comparison between Baichuan-Med-14B and other models. In the case of a short context, the KV cache of Baichuanmed-14B is approximately equal to GQA of 6 KV heads, and in the case of a long context, it is approximately equal to the GQA of 4 kv heads.
  • Figure 8: The attention mechanism used by Baichuan-M1-14B.
  • Figure 9: The tokenization efficiency for different models, the less the better.
  • ...and 3 more figures