Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

Ruiyang Ren; Yuhao Wang; Yunsen Liang; Lan Luo; Jing Liu; Haifeng Wang; Cong Feng; Yinan Zhang; Chunyan Miao; Ji-Rong Wen; Wayne Xin Zhao

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

Ruiyang Ren, Yuhao Wang, Yunsen Liang, Lan Luo, Jing Liu, Haifeng Wang, Cong Feng, Yinan Zhang, Chunyan Miao, Ji-Rong Wen, Wayne Xin Zhao

TL;DR

DxEvolve is developed, a self-evolving diagnostic agent that bridges gaps through an interactive deep clinical research workflow and supports an accountable pathway for the continual evolution of clinical AI.

Abstract

Clinical diagnosis is a complex cognitive process, grounded in dynamic cue acquisition and continuous expertise accumulation. Yet most current artificial intelligence (AI) systems are misaligned with this reality, treating diagnosis as single-pass retrospective prediction while lacking auditable mechanisms for governed improvement. We developed DxEvolve, a self-evolving diagnostic agent that bridges these gaps through an interactive deep clinical research workflow. The framework autonomously requisitions examinations and continually externalizes clinical experience from increasing encounter exposure as diagnostic cognition primitives. On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4% on a reader-study subset, comparable to the clinician reference (88.8%). DxEvolve improved accuracy on an independent external cohort by 10.2% (categories covered by the source cohort) and 17.1% (uncovered categories) compared to the competitive method. By transforming experience into a governable learning asset, DxEvolve supports an accountable pathway for the continual evolution of clinical AI.

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

TL;DR

Abstract

Paper Structure (16 sections, 6 figures)

This paper contains 16 sections, 6 figures.

Introduction
Results
Experimental design and the DxEvolve framework
DxEvolve achieves clinician-level diagnostic performance
External validation supports cross-institution portability of experiential gains
Self-evolution shows exposure-dependent scaling behavior and error-driven correction
Self-evolution is accompanied by progressive maturation of experience
DxEvolve's evidence acquisition aligns with clinical workflows and clinical guidelines
Discussion
Methods
DxEvolve framework
Data sources
Evaluation cohorts
Ethics approval and governance
Models and implementation
...and 1 more sections

Figures (6)

Figure 1: DxEvolve: workflow-aligned diagnosis with experience-driven self-evolution.a, DxEvolve frames diagnosis as evidence-centered sequential reasoning, contrasting the static, single-pass inference typical of retrospective evaluations using complete records. b, Deep clinical research (DCR) workflow. From the patient history context, the agent iteratively plans the next step, requests evaluations (physical examination, laboratory tests and imaging) and, when necessary, consults external sources (guidelines and PubMed); only requested observations are revealed and are integrated into a compact high-salience encounter state to guide subsequent actions until final diagnosis. c, Diagnostic cognition primitives (DCPs). After each diagnosis reasoning, DxEvolve consolidates a DCP from the trajectory, consisting of a retrievable presentation pattern and evidence-linked guidance for investigation planning and diagnostic decision-making; DCPs are indexed in a repository and selectively reused in later encounters as an action like medical evaluation and searching external sources under the same DCR workflow. d, Cohorts and protocol. DCPs are built from a MIMIC-CDM accrual pool that is strictly non-overlapping with evaluation encounters, then assessed on a held-out in-distribution MIMIC-CDM cohort and an external hospital cohort for out-of-distribution evaluation.
Figure 2: Main diagnostic performance results on MIMIC-CDM.a, Diagnosis accuracy on the MIMIC-CDM evaluation cohort ($n$=400), reported per pathology and as the average. For each base LLM (color), we compare the CDM baseline, DxEvolve without DCP retrieval (DxEvolve w/o DCP), and DxEvolve over multiple seeds. b, Accuracy improvement of DxEvolve over the CDM baseline stratified by encounter-level diagnostic burden (easy versus hard). Points show the stratum-specific improvement for each base LLM; annotations indicate the improvement in each stratum and the between-stratum difference. c, Diagnosis accuracy on a reader-study subset of MIMIC-CDM ($n$=80). Bars report average diagnostic accuracy for CDM and DxEvolve distinguished by light and dark shades of the same color, together with single-pass full-information (FI) inference (hatched). Specialist medical LLMs with limited action compliance are reported under FI only. The clinician reference (Doctors) corresponds to the published reader-study subset with full information available hager_evaluation_2024.
Figure 3: External validation on an independent hospital cohort.a, Diagnostic accuracy on diagnoses overlapping with MIMIC-CDM (appendicitis, cholecystitis and pancreatitis) and their mean, evaluated using standardized English translations of the structured records. b, Category-level transfer on diagnoses that were never used for DCP accrual (liver abscess, urinary tract infection) and their mean, evaluated under the same protocol. c, Robustness to documentation with native institutional language, evaluated on the same external encounters using the original Chinese records.
Figure 4: Exposure-dependent self-evolution and provenance of retrieved experience.a, Overall diagnosis accuracy on the fixed MIMIC-CDM evaluation cohort ($n$=400) as the DCP accrual pool increases, shown for three representative base LLM backbones. Accuracy improves with additional accrual encounters and then tapers, yielding a saturating learning curve. b, Provenance of retrieved experience during evaluation. Bars show the fraction of retrieved DCPs whose source accrual episode ended in an incorrect diagnosis ("incorrect experience rate"), computed separately for improvement cases and for all evaluation encounters pooled. $P$ values indicate enrichment of incorrect-source DCPs among retrievals in improvement cases.
Figure 5: Maturation of accrued experience artifacts with encounter exposure.a, Blinded clinician ratings of diagnostic cognition primitives (DCPs) sampled from an early exposure window (encounters 1–300; $n$=20) and a late window (encounters 1700–2000; $n$=20). DCPs were scored for clinical correctness, actionability and generalizability, with the mean shown as an aggregate. Boxes denote interquartile range, centre line the median, and points individual DCPs; two-sided P values are shown (n.s., not significant). b, Inter-rater reliability of clinician ratings for the aggregate DCP score (ICC=0.81), supporting the reliability of the clinician assessment. c, Evaluation-time retrieval signal for late-stage DCPs, quantified as the fraction of retrieval events that involve DCPs in the late encounter window.
...and 1 more figures

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

TL;DR

Abstract

Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

Authors

TL;DR

Abstract

Table of Contents

Figures (6)