MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis
Ruihui Hou, Shencheng Chen, Yongqi Fan, Guangya Yu, Lifeng Zhu, Jing Sun, Jingping Liu, Tong Ruan
TL;DR
MSDiagnosis addresses the gap between real-world multi-step clinical diagnosis and existing single-step benchmarks by introducing a Chinese EMR-based benchmark of 2,225 cases across 12 departments. It couples a retrieval-enhanced forward inference stage with a backward inference, reflection, and refinement stage to drive self-evaluation and result improvement in LLM-based diagnostics. The paper offers a rigorous data collection, annotation, and key-points framework, provides extensive experiments across open- and closed-source LLMs, and demonstrates that the proposed framework yields consistent gains in final-diagnosis quality and interpretability despite overall performance gaps to human experts. This benchmark and framework hold practical significance for evaluating and guiding LLM-driven diagnostic reasoning in realistic, multi-step clinical settings.
Abstract
Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a Chinese clinical diagnostic benchmark, called MSDiagnosis. This benchmark consists of 2,225 cases from 12 departments, covering tasks such as primary diagnosis, differential diagnosis, and final diagnosis. Additionally, we propose a novel and effective framework. This framework combines forward inference, backward inference, reflection, and refinement, enabling the large language model to self-evaluate and adjust its diagnostic results. To this end, we test open-source models, closed-source models, and our proposed framework.The experimental results demonstrate the effectiveness of the proposed method. We also provide a comprehensive experimental analysis and suggest future research directions for this task.
