Table of Contents
Fetching ...

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

Qichen Ye, Junling Liu, Dading Chong, Peilin Zhou, Yining Hua, Fenglin Liu, Meng Cao, Ziming Wang, Xuxin Cheng, Zhu Lei, Zhenhua Guo

TL;DR

<3-5 sentence high-level summary> Qilin-Med tackles the challenge of domain-adapting medical LLMs without prohibitive full-domain pre-training or RLHF complexity. It introduces a three-stage training pipeline—Domain-specific Continued Pre-training, Supervised Fine-Tuning, and Direct Preference Optimization—supplemented by Retrieval Augmented Generation and built on the ChiMed Chinese medical dataset. Across CMExam, CEval, and Huatuo-26M benchmarks, the approach shows progressive improvements in medical knowledge, instruction following, and preference alignment, with RAG providing additional gains. This work demonstrates an effective, scalable path to injecting medical expertise into Chinese LLMs while highlighting practical considerations for safety, evaluation, and deployment in research contexts.

Abstract

Integrating large language models (LLMs) into healthcare holds great potential but faces challenges. Pre-training LLMs from scratch for domains like medicine is resource-heavy and often unfeasible. On the other hand, sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domain-specific insights. In response, we present a multi-stage training method combining Domain-specific Continued Pre-training (DCPT), SFT, and Direct Preference Optimization (DPO). In addition, we publish a 3Gb Chinese Medicine (ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs, and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, shows substantial performance improvement. In the CPT and SFT phases, Qilin-Med achieved 38.4% and 40.0% accuracy on the CMExam test set, respectively. It outperformed the basemodel Baichuan-7B (accuracy: 33.5%), by 7.5%. In the DPO phase, it scored 16.66 in BLEU-1 and 27.44 in ROUGE-1 on the Huatuo-26M test set, bringing further improvement to the SFT phase (12.69 in BLEU-1 and 24.21 in ROUGE-1). Additionally, we have further enhanced the model's performance through the Retrieval Augmented Generation (RAG) approach. Experiments demonstrate that Qilin-Med-RAG achieves an accuracy rate of 42.8% on CMExam. These results highlight the contribution of our novel training approach in building LLMs for medical applications.

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

TL;DR

<3-5 sentence high-level summary> Qilin-Med tackles the challenge of domain-adapting medical LLMs without prohibitive full-domain pre-training or RLHF complexity. It introduces a three-stage training pipeline—Domain-specific Continued Pre-training, Supervised Fine-Tuning, and Direct Preference Optimization—supplemented by Retrieval Augmented Generation and built on the ChiMed Chinese medical dataset. Across CMExam, CEval, and Huatuo-26M benchmarks, the approach shows progressive improvements in medical knowledge, instruction following, and preference alignment, with RAG providing additional gains. This work demonstrates an effective, scalable path to injecting medical expertise into Chinese LLMs while highlighting practical considerations for safety, evaluation, and deployment in research contexts.

Abstract

Integrating large language models (LLMs) into healthcare holds great potential but faces challenges. Pre-training LLMs from scratch for domains like medicine is resource-heavy and often unfeasible. On the other hand, sole reliance on Supervised Fine-tuning (SFT) can result in overconfident predictions and may not tap into domain-specific insights. In response, we present a multi-stage training method combining Domain-specific Continued Pre-training (DCPT), SFT, and Direct Preference Optimization (DPO). In addition, we publish a 3Gb Chinese Medicine (ChiMed) dataset, encompassing medical question answering, plain texts, knowledge graphs, and dialogues, segmented into three training stages. The medical LLM trained with our pipeline, Qilin-Med, shows substantial performance improvement. In the CPT and SFT phases, Qilin-Med achieved 38.4% and 40.0% accuracy on the CMExam test set, respectively. It outperformed the basemodel Baichuan-7B (accuracy: 33.5%), by 7.5%. In the DPO phase, it scored 16.66 in BLEU-1 and 27.44 in ROUGE-1 on the Huatuo-26M test set, bringing further improvement to the SFT phase (12.69 in BLEU-1 and 24.21 in ROUGE-1). Additionally, we have further enhanced the model's performance through the Retrieval Augmented Generation (RAG) approach. Experiments demonstrate that Qilin-Med-RAG achieves an accuracy rate of 42.8% on CMExam. These results highlight the contribution of our novel training approach in building LLMs for medical applications.
Paper Structure (25 sections, 3 equations, 4 figures, 6 tables)

This paper contains 25 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Experimental results of our proposed Qilin-Med-7B-CPT, Qilin-Med-7B-SFT, and Qilin-Med-7B-DPO, which demonstrate superior performance on both reasoning and prediction tasks.
  • Figure 2: The construction pipeline of Qilin-Med. Stage 1 conducts the domain-specific continued pretraining to strengthen the fundamental medical knowledge; Stage 2 applies the instruction supervised fine-tuning to stimulate the interpretive and responsive capabilities of the model; Stage 3 aims to align the model output with human preference.
  • Figure 3: A conversation example from Huatuo-26M dialogue. Compared to Baichuan-7B, Qilin-Med-7B with CPT, SFT, and DPO generated more relevant and informative responses.
  • Figure 4: A conversational case on CMExam dataset. Compared to LLaMA, ChatGLM, and GPT-4. Qilin-Med-7B-CPT and Qilin-Med-7B-SFT generated more relevant and informative responses.