Table of Contents
Fetching ...

Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu, Shaoting Zhang, Xiaofan Zhang, Kun Qian, Xiaomu Li

TL;DR

PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds, and exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians.

Abstract

We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.

Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

TL;DR

PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds, and exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians.

Abstract

We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
Paper Structure (23 sections, 5 equations, 7 figures, 1 algorithm)

This paper contains 23 sections, 5 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Experiment settings and overall performance.Upper: The PULSE agent was evaluated against physicians across three experience strata (Senior, Junior, and Resident) on these 82 endocrinology cases collected across seven subspecialty domains and stratified into three incidence categories based on disease prevalence. Bottom left: Diagnostic accuracy quantified by Top@1 (purple) and Top@4 (pink). Asterisks indicate statistical significance compared to PULSE (*$P<0.05$, **$P<0.01$, ***$P<0.001$, ****$P<1 \times 10^{-4}$; paired McNemar test, Holm-corrected). Bottom right: Breadth of diagnostic hypotheses (Search Space Width).
  • Figure 2: a, Comparison of diagnostic accuracy stratified by three disease incidence tiers (0.01–1%, 0.001–0.01%, and < 0.001%) for PULSE, the senior specialist, pooled junior specialists ($n=5$), and pooled residents ($n=5$). b, Top@1 and Top@4 accuracy of individual junior specialists plotted against their clinical experience in months. Points show observed accuracy, shaded band indicates 95% Wilson confidence intervals, dashed line denotes the linear trend across individuals. c, Granular analysis of diagnostic accuracy across incidence tiers for individual junior specialists, annotated with their cumulative clinical experience in months.
  • Figure 3: a, Distributions of per-case diagnostic time for physicians (left y-axis; residents, junior specialists, senior specialists) and per-case model deliberation measured as output length in tokens (right y-axis; PULSE, gpt-oss-120B, Qwen3-235B, DeepSeek-R1). Points denote individual cases; boxplots show median and interquartile range; half-violins depict kernel density. b, Physician diagnostic time stratified by case difficulty, defined by bins of mean diagnostic accuracy across all participants (0–0.3, 0.3–0.6, $>$0.6). Boxplots summarize case-level times within each difficulty bin and experience group. c, Model output length (tokens) stratified by the same difficulty bins.
  • Figure 4: a, Sankey diagrams showing how five junior specialists’ diagnostic decisions transitioned before and after reviewing the agent’s output. Flows track the transitions between correct (yellow) and incorrect (red) diagnoses for each specialist. Node widths and link thicknesses are proportional to case counts, illustrating heterogeneous yet consistently positive net gains across physicians. b, Quantitative characterization of collaboration trade-offs. Top panel plots Correction rate (fraction of initially wrong diagnoses corrected after AI assistance) against Misleading rate (fraction of initially correct diagnoses overturned incorrectly), with point color indicating net benefit (correction minus misleading). Bottom panel plots Stability rate (retention of correct initial diagnoses) against Inefficacy rate (retention of incorrect initial diagnoses), with color indicating net stability (stability minus inefficacy). Dashed lines denote cohort means, and shaded quadrants highlight favorable (high correction or stability with low misleading or inefficacy) versus unfavorable regimes.
  • Figure 5: Impact of AI assistance stratified by disease incidence. Dumbbell plots track the absolute change in diagnostic accuracy for five junior specialists across three rarity tiers. Hollow circles represent unassisted baseline performance, and solid stars indicate AI-assisted performance. Lines connect within-physician changes (green for improvement, red for decline).
  • ...and 2 more figures