Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support
Guoxin Wang, Minyu Gao, Shuai Yang, Ya Zhang, Lizhi He, Liang Huang, Hanlin Xiao, Yexuan Zhang, Wanyue Li, Lu Chen, Jintao Fei, Xin Li
TL;DR
This work introduces Citrus, a medical language model that emulates expert clinicians’ cognitive pathways to enhance disease reasoning and clinical decision support. It leverages a multi-stage training pipeline—continuous pre-training (CPT) for pattern recognition, supervised fine-tuning (SFT) for structured reasoning, and reinforcement learning (RL) for alignment—trained on a large corpus of simulated expert reasoning and a new real-world clinical benchmark. The authors release Citrus’s training data and the JDH Medical Practice Benchmark (JMED), designed to reflect real-world disease distribution and clinical ambiguity, and demonstrate Citrus’ superior performance on medical benchmarks (e.g., MedQA, PubMedQA) compared to 70B-scale and larger baselines. The work highlights the potential of expert-pathway data to improve medical decision support while stressing careful evaluation and ethical use, given the high stakes in clinical contexts.
Abstract
Large language models (LLMs), particularly those with reasoning capabilities, have rapidly advanced in recent years, demonstrating significant potential across a wide range of applications. However, their deployment in healthcare, especially in disease reasoning tasks, is hindered by the challenge of acquiring expert-level cognitive data. In this paper, we introduce Citrus, a medical language model that bridges the gap between clinical expertise and AI reasoning by emulating the cognitive processes of medical experts. The model is trained on a large corpus of simulated expert disease reasoning data, synthesized using a novel approach that accurately captures the decision-making pathways of clinicians. This approach enables Citrus to better simulate the complex reasoning processes involved in diagnosing and treating medical conditions. To further address the lack of publicly available datasets for medical reasoning tasks, we release the last-stage training data, including a custom-built medical diagnostic dialogue dataset. This open-source contribution aims to support further research and development in the field. Evaluations using authoritative benchmarks such as MedQA, covering tasks in medical reasoning and language understanding, show that Citrus achieves superior performance compared to other models of similar size. These results highlight Citrus potential to significantly enhance medical decision support systems, providing a more accurate and efficient tool for clinical decision-making.
