Table of Contents
Fetching ...

DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang

TL;DR

This case report describes a rare extramedullary manifestation of Richter's transformation, where chronic lymphocytic leukemia evolves into diffuse large B-cell lymphoma presenting as a penile ulcer. The authors document the clinical presentation, histopathology with CD20+, CD79a+, CD5+ B-cells, and clonal IgH rearrangement, as well as PET-CT–based nodal involvement. The diagnostic process underscores the necessity of combining immunophenotyping, molecular testing, and imaging to distinguish transformation from infections or benign inflammatory processes. The findings highlight the clinical significance of recognizing uncommon sites of Richter's transformation to guide management.

Abstract

The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI's diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.

DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

TL;DR

This case report describes a rare extramedullary manifestation of Richter's transformation, where chronic lymphocytic leukemia evolves into diffuse large B-cell lymphoma presenting as a penile ulcer. The authors document the clinical presentation, histopathology with CD20+, CD79a+, CD5+ B-cells, and clonal IgH rearrangement, as well as PET-CT–based nodal involvement. The diagnostic process underscores the necessity of combining immunophenotyping, molecular testing, and imaging to distinguish transformation from infections or benign inflammatory processes. The findings highlight the clinical significance of recognizing uncommon sites of Richter's transformation to guide management.

Abstract

The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI's diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.

Paper Structure

This paper contains 22 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Performance of SOTA models on DiagnosisArena and other benchmarks.
  • Figure 2: Overview of the DiagnosisArena Benchmark.(a) The pipeline for constructing the DiagnosisArena dataset consists of four stages: data collection from the journals, data structuring, iterative filtering of non-reasoning examples, and expert-AI collaborative verification. (b) DiagnosisArena is sourced from 10 top-tier medical journals. (c) DiagnosisArena is highly diverse, covering 28 medical specialties. (d) DiagnosisArena boasts clearly defined segments and offers information-dense clinical cases, which align more closely with clinical practice and present greater reasoning complexity.
  • Figure 3: Performance of Different Models on DiagnosisArena. (a) The Top-$k$ metric represents the proportion of cases where the correct answer is included among the Top-$k$ predictions generated by the model, ranked in descending order of confidence. The results reveal that while the o3 outperforms others, DiagnosisArena remains a significant challenge for all existing models. (b) The MCQ presents the multiple-choice version of DiagnosisArena. A marked increase in model performance can be observed, with o1 reaching 61.90%.
  • Figure 4: Leakage Detection on DiagnosisArena. (a) Pre-experiment small sample Leakage Detection. For all models, the experimental results maintained a generally consistent trend across different years, with only minor fluctuations. (b) Leakage Detection on the Constructed DiagnosisArena. Over the past decade, all models have demonstrated relatively stable accuracy, with no significant fluctuations over time.
  • Figure 5: A Case Study of DiagnosisArena. Except for o3-mini, which successfully provided the correct answer in the top 1, the other models were far from the correct answer. Analyzing DeepSeek-R1's response, we found that, despite numerous indirect pieces of evidence supporting the diagnosis of AMVT, DeepSeek-R1 selectively ignored these clues and instead overly relied on the reasoning paths of common diseases. DeepSeek-R1's response is shown in Appendix \ref{['apdx: case']}.
  • ...and 2 more figures