Table of Contents
Fetching ...

Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

This work introduces MedR-Bench, a large, real-world clinical reasoning benchmark with 1,453 structured patient cases spanning 13 body systems and 10 specialties, including a substantial rare-disease subset. It pairs this dataset with a three-stage evaluation framework—examination recommendation, diagnostic decision-making, and treatment planning—and a novel Reasoning Evaluator that automatically scores reasoning efficiency, factuality, and completeness against medical knowledge and case-ground-truth references. Across five reasoning-enabled system variants, results show strong diagnostic performance when information is sufficient, but meaningful gaps remain in examination planning and treatment planning, especially for rare diseases; open-source solutions like DeepSeek-R1 are narrowing the gap with proprietary systems. The study emphasizes both progress and limitations in clinical reasoning for AI, provides open data and tooling, and calls for continued human oversight to ensure safe and reliable real-world deployment.

Abstract

Recent advancements in reasoning-enhanced large language models (LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated significant progress. However, their application in professional medical contexts remains underexplored, particularly in evaluating the quality of their reasoning processes alongside final outputs. Here, we introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references derived from clinical case reports. Spanning 13 body systems and 10 specialties, it includes both common and rare diseases. To comprehensively evaluate LLM performance, we propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. To assess reasoning quality, we present the Reasoning Evaluator, a novel automated system that objectively scores free-text reasoning responses based on efficiency, actuality, and completeness using dynamic cross-referencing and evidence checks. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc. Our results show that current LLMs achieve over 85% accuracy in relatively simple diagnostic tasks when provided with sufficient examination results. However, performance declines in more complex tasks, such as examination recommendation and treatment planning. While reasoning outputs are generally reliable, with factuality scores exceeding 90%, critical reasoning steps are frequently missed. These findings underscore both the progress and limitations of clinical LLMs. Notably, open-source models like DeepSeek-R1 are narrowing the gap with proprietary systems, highlighting their potential to drive accessible and equitable advancements in healthcare.

Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

TL;DR

This work introduces MedR-Bench, a large, real-world clinical reasoning benchmark with 1,453 structured patient cases spanning 13 body systems and 10 specialties, including a substantial rare-disease subset. It pairs this dataset with a three-stage evaluation framework—examination recommendation, diagnostic decision-making, and treatment planning—and a novel Reasoning Evaluator that automatically scores reasoning efficiency, factuality, and completeness against medical knowledge and case-ground-truth references. Across five reasoning-enabled system variants, results show strong diagnostic performance when information is sufficient, but meaningful gaps remain in examination planning and treatment planning, especially for rare diseases; open-source solutions like DeepSeek-R1 are narrowing the gap with proprietary systems. The study emphasizes both progress and limitations in clinical reasoning for AI, provides open data and tooling, and calls for continued human oversight to ensure safe and reliable real-world deployment.

Abstract

Recent advancements in reasoning-enhanced large language models (LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated significant progress. However, their application in professional medical contexts remains underexplored, particularly in evaluating the quality of their reasoning processes alongside final outputs. Here, we introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references derived from clinical case reports. Spanning 13 body systems and 10 specialties, it includes both common and rare diseases. To comprehensively evaluate LLM performance, we propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. To assess reasoning quality, we present the Reasoning Evaluator, a novel automated system that objectively scores free-text reasoning responses based on efficiency, actuality, and completeness using dynamic cross-referencing and evidence checks. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc. Our results show that current LLMs achieve over 85% accuracy in relatively simple diagnostic tasks when provided with sufficient examination results. However, performance declines in more complex tasks, such as examination recommendation and treatment planning. While reasoning outputs are generally reliable, with factuality scores exceeding 90%, critical reasoning steps are frequently missed. These findings underscore both the progress and limitations of clinical LLMs. Notably, open-source models like DeepSeek-R1 are narrowing the gap with proprietary systems, highlighting their potential to drive accessible and equitable advancements in healthcare.

Paper Structure

This paper contains 27 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of our main evaluation pipeline and results. a illustrates our evaluation framework across three critical patient stages. b presents the metrics for reasoning processes and final generations using our Reasoning Evaluator. c compares the performance of five LLMs on examination recommendation, diagnostic decision-making, and treatment planning. Notably, for treatment planning, we include a comparison on rare disease cases. For other settings, as the rare disease results show minimal variation compared to all cases, we omit them here and provide them in the extended tables. d compares the qualities of reasoning processes, with results for rare cases also provided in the supplementary material. For examination recommendation, 1-turn reasoning results are plotted, and for diagnostic decision, oracle reasoning results are plotted.
  • Figure 1: Overview of our evaluation settings. We consider three stages: examination recommendation, diagnosis decision-making, and treatment planning. a, b illustrate the 1-turn and free-turn interaction pipelines for examination recommendation. c, d, e depict the evaluation cases for diagnosis decision-making in 1-turn, free-turn, and oracle settings. Finally, f presents the treatment planning task in the oracle setting.
  • Figure 1: Case 1. A case of 1-turn examination recommendation and diagnostic decision-making. The meaning of the row headers is explained at the beginning of Supplementary \ref{['case_study']}
  • Figure 2: Overview of our data curation pipeline, Reasoning Evaluator, and final patient case distributions.a illustrates our data curation pipeline using a flowchart. We start with the original case reports from the PMC-OA subset, then filter and reorganize them into structured patient cases for testing. b depicts our Reasoning Evaluator to quantitatively measure reasoning quality from efficiency, factuality, and completeness aspects. External search engines are employed to assist the agent in more accurately evaluating the correctness of the provided reasoning steps. c This figure presents the distribution of patient cases across different medical aspects.
  • Figure 2: Case 2. A case of oracle diagnosis on common disease. The meaning of the row headers is explained at the beginning of Supplementary \ref{['case_study']}
  • ...and 3 more figures