MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

Ruihui Hou; Shencheng Chen; Yongqi Fan; Guangya Yu; Lifeng Zhu; Jing Sun; Jingping Liu; Tong Ruan

MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

Ruihui Hou, Shencheng Chen, Yongqi Fan, Guangya Yu, Lifeng Zhu, Jing Sun, Jingping Liu, Tong Ruan

TL;DR

MSDiagnosis addresses the gap between real-world multi-step clinical diagnosis and existing single-step benchmarks by introducing a Chinese EMR-based benchmark of 2,225 cases across 12 departments. It couples a retrieval-enhanced forward inference stage with a backward inference, reflection, and refinement stage to drive self-evaluation and result improvement in LLM-based diagnostics. The paper offers a rigorous data collection, annotation, and key-points framework, provides extensive experiments across open- and closed-source LLMs, and demonstrates that the proposed framework yields consistent gains in final-diagnosis quality and interpretability despite overall performance gaps to human experts. This benchmark and framework hold practical significance for evaluating and guiding LLM-driven diagnostic reasoning in realistic, multi-step clinical settings.

Abstract

Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a Chinese clinical diagnostic benchmark, called MSDiagnosis. This benchmark consists of 2,225 cases from 12 departments, covering tasks such as primary diagnosis, differential diagnosis, and final diagnosis. Additionally, we propose a novel and effective framework. This framework combines forward inference, backward inference, reflection, and refinement, enabling the large language model to self-evaluate and adjust its diagnostic results. To this end, we test open-source models, closed-source models, and our proposed framework.The experimental results demonstrate the effectiveness of the proposed method. We also provide a comprehensive experimental analysis and suggest future research directions for this task.

MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

TL;DR

Abstract

Paper Structure (38 sections, 1 equation, 9 figures, 10 tables)

This paper contains 38 sections, 1 equation, 9 figures, 10 tables.

Introduction
Problem Formulation
MSDiagnosis
Data Collection and Selection
Data Annotation
Question Construction
Answer Annotation
Key Points Annotation
Dataset Analysis
Data Statistics
Data Characteristics
Method
Framework
Forward Inference
Backward Inference and Reflection
...and 23 more sections

Figures (9)

Figure 1: An example of our diagnostic benchmark and its differences from the previous diagnostic benchmark.
Figure 2: Distribution of "Primary diagnosis matches final diagnosis" and "Primary diagnosis differs from final diagnosis" in MSDiagnosis across different departments.
Figure 3: Our framework for the multi-step clinical diagnosis. The top portion of the figure illustrates the flow of the framework, comprising two stages. The first stage involves the forward inference diagnosis. The second stage focuses on backward inference, reflection, and refinement.
Figure 4: Case study. The green (red) highlight indicates correct (incorrect) results. Purple marks lack of domain knowledge errors, blue marks symptom-disease confusion errors, and yellow marks diagnostic criteria inconsistent with facts.
Figure 5: The prompt of forward inference.
...and 4 more figures

MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

TL;DR

Abstract

MSDiagnosis: A Benchmark for Evaluating Large Language Models in Multi-Step Clinical Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (9)