ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang
TL;DR
This paper tackles the challenge of applying deep reasoning to clinical diagnosis with large language models by introducing ClinicalGPT-R1, a reasoning-enhanced generalist LLM for multilingual medical tasks. It employs a two-stage training strategy (SFT followed by PPO-based RL) and builds a large synthetic long-chain reasoning data corpus to boost diagnostic reasoning, validated on MedBench-Hard, a 3,500-case benchmark spanning seven departments. Results show ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and is competitive with GPT-4o in English, demonstrating effective cross-linguistic diagnostic reasoning. The authors contribute a rigorous evaluation framework and public resources, enabling broader adoption of advanced clinical reasoning in multilingual healthcare contexts.
Abstract
Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.
