Table of Contents
Fetching ...

Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support

Guoxin Wang, Minyu Gao, Shuai Yang, Ya Zhang, Lizhi He, Liang Huang, Hanlin Xiao, Yexuan Zhang, Wanyue Li, Lu Chen, Jintao Fei, Xin Li

TL;DR

This work introduces Citrus, a medical language model that emulates expert clinicians’ cognitive pathways to enhance disease reasoning and clinical decision support. It leverages a multi-stage training pipeline—continuous pre-training (CPT) for pattern recognition, supervised fine-tuning (SFT) for structured reasoning, and reinforcement learning (RL) for alignment—trained on a large corpus of simulated expert reasoning and a new real-world clinical benchmark. The authors release Citrus’s training data and the JDH Medical Practice Benchmark (JMED), designed to reflect real-world disease distribution and clinical ambiguity, and demonstrate Citrus’ superior performance on medical benchmarks (e.g., MedQA, PubMedQA) compared to 70B-scale and larger baselines. The work highlights the potential of expert-pathway data to improve medical decision support while stressing careful evaluation and ethical use, given the high stakes in clinical contexts.

Abstract

Large language models (LLMs), particularly those with reasoning capabilities, have rapidly advanced in recent years, demonstrating significant potential across a wide range of applications. However, their deployment in healthcare, especially in disease reasoning tasks, is hindered by the challenge of acquiring expert-level cognitive data. In this paper, we introduce Citrus, a medical language model that bridges the gap between clinical expertise and AI reasoning by emulating the cognitive processes of medical experts. The model is trained on a large corpus of simulated expert disease reasoning data, synthesized using a novel approach that accurately captures the decision-making pathways of clinicians. This approach enables Citrus to better simulate the complex reasoning processes involved in diagnosing and treating medical conditions. To further address the lack of publicly available datasets for medical reasoning tasks, we release the last-stage training data, including a custom-built medical diagnostic dialogue dataset. This open-source contribution aims to support further research and development in the field. Evaluations using authoritative benchmarks such as MedQA, covering tasks in medical reasoning and language understanding, show that Citrus achieves superior performance compared to other models of similar size. These results highlight Citrus potential to significantly enhance medical decision support systems, providing a more accurate and efficient tool for clinical decision-making.

Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support

TL;DR

This work introduces Citrus, a medical language model that emulates expert clinicians’ cognitive pathways to enhance disease reasoning and clinical decision support. It leverages a multi-stage training pipeline—continuous pre-training (CPT) for pattern recognition, supervised fine-tuning (SFT) for structured reasoning, and reinforcement learning (RL) for alignment—trained on a large corpus of simulated expert reasoning and a new real-world clinical benchmark. The authors release Citrus’s training data and the JDH Medical Practice Benchmark (JMED), designed to reflect real-world disease distribution and clinical ambiguity, and demonstrate Citrus’ superior performance on medical benchmarks (e.g., MedQA, PubMedQA) compared to 70B-scale and larger baselines. The work highlights the potential of expert-pathway data to improve medical decision support while stressing careful evaluation and ethical use, given the high stakes in clinical contexts.

Abstract

Large language models (LLMs), particularly those with reasoning capabilities, have rapidly advanced in recent years, demonstrating significant potential across a wide range of applications. However, their deployment in healthcare, especially in disease reasoning tasks, is hindered by the challenge of acquiring expert-level cognitive data. In this paper, we introduce Citrus, a medical language model that bridges the gap between clinical expertise and AI reasoning by emulating the cognitive processes of medical experts. The model is trained on a large corpus of simulated expert disease reasoning data, synthesized using a novel approach that accurately captures the decision-making pathways of clinicians. This approach enables Citrus to better simulate the complex reasoning processes involved in diagnosing and treating medical conditions. To further address the lack of publicly available datasets for medical reasoning tasks, we release the last-stage training data, including a custom-built medical diagnostic dialogue dataset. This open-source contribution aims to support further research and development in the field. Evaluations using authoritative benchmarks such as MedQA, covering tasks in medical reasoning and language understanding, show that Citrus achieves superior performance compared to other models of similar size. These results highlight Citrus potential to significantly enhance medical decision support systems, providing a more accurate and efficient tool for clinical decision-making.

Paper Structure

This paper contains 61 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Citrus rank high in several authoritative medical benchmarks, comparing with two widely used LLMs, GPT-4o and Claude and a powerful 70B scale LLM, which is distilled from DeepSeek-R1.
  • Figure 2: LLMs preforms similar cognitive pathways as medical experts. CPT enabled LLMs to learn medical knowledge and perform pattern recognition as doctors do, meanwhile LLMs are capable to handle hypothetical-deductive reasoning by executing several specific reasoning steps, which can be trained through SFT and RL procedure.
  • Figure 3: Overview of training stages and training data pipeline . The training process consists of three stages: CPT, SFT, and RL. We shows training purposes and dataset scale on each stage, also, we points out the data pipeline in corresponding stage.
  • Figure 4: Token distribution statistics of stage 3 SFT training data. The data are designed and manufactured to simulate the long COT reasoning process of medical experts. We confirm a average length of 656 tokens and upper-bound of the length is around 4k.
  • Figure 5: The construction framework of our dataset, JMED, is illustrated with arrows indicating the type of data used as input for the LLMs and the corresponding response obtained at each step. We begin by consolidating the dialogue data into EMRs, then transform it into the format of medical examination questions, and finally, through option expansion and quality checks, we obtain our dataset.
  • ...and 6 more figures