Table of Contents
Fetching ...

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Jian Zhang, Yanfeng Wang, Ya Zhang, Weidi Xie

TL;DR

This work tackles the trust gap in clinical AI by reimagining diagnostic AI as an agentic, end-to-end reinforcement-learned system (Deep-DxSearch) that actively searches and reasons over a vast medical retrieval environment. By formalizing diagnosis as a sequential decision process with five primitives (<reason>, <lookup>, <match>, <search>, <diagnose>) and optimizing a composite reward for reasoning validity and evidence quality, the approach achieves superior accuracy and generalization across common and rare diseases in multi-center benchmarks. Human-in-the-loop studies show physicians prefer its auditable evidence chains, suggesting improved clinical trust and utility despite longer processing times. The method addresses rare-disease challenges, reduces hallucinations, and provides a blueprint for trustworthy, transparent diagnostic assistants grounded in Evidence-Based Medicine, with data, code, and checkpoints publicly available.

Abstract

The integration of Large Language Models (LLMs) into healthcare is constrained by knowledge limitations, hallucinations, and a disconnect from Evidence-Based Medicine (EBM). While Retrieval-Augmented Generation (RAG) offers a solution, current systems often rely on static workflows that miss the iterative, hypothetico-deductive reasoning of clinicians. To address this, we introduce Deep-DxSearch, an agentic RAG system trained end-to-end via reinforcement learning (RL) for traceable diagnostic reasoning. Deep-DxSearch acts as an active investigator, treating the LLM as an agent within an environment of 16,000+ guideline-derived disease profiles, 150,000+ patient records for case-based reasoning, and over 27 million biomedical documents. Using soft verifiable rewards that co-optimize retrieval and reasoning, the model learns to formulate queries, evaluate evidence, and refine searches to close diagnostic gaps. Experiments show our end-to-end RL framework consistently outperforms prompt-engineering and training-free RAG methods. On in-distribution (ID) and out-of-distribution (OOD) benchmarks for common and rare diseases, Deep-DxSearch surpasses strong baselines-including GPT-4o, DeepSeek-R1, and medical-specific frameworks-achieving an average accuracy gain of 22.7% over the second-best model. In validation with 150 real-world cases, Deep-DxSearch boosts physicians' average diagnostic accuracy from 45.6% to 69.1%. These results indicate that evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. All data, code, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch.

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

TL;DR

This work tackles the trust gap in clinical AI by reimagining diagnostic AI as an agentic, end-to-end reinforcement-learned system (Deep-DxSearch) that actively searches and reasons over a vast medical retrieval environment. By formalizing diagnosis as a sequential decision process with five primitives (<reason>, <lookup>, <match>, <search>, <diagnose>) and optimizing a composite reward for reasoning validity and evidence quality, the approach achieves superior accuracy and generalization across common and rare diseases in multi-center benchmarks. Human-in-the-loop studies show physicians prefer its auditable evidence chains, suggesting improved clinical trust and utility despite longer processing times. The method addresses rare-disease challenges, reduces hallucinations, and provides a blueprint for trustworthy, transparent diagnostic assistants grounded in Evidence-Based Medicine, with data, code, and checkpoints publicly available.

Abstract

The integration of Large Language Models (LLMs) into healthcare is constrained by knowledge limitations, hallucinations, and a disconnect from Evidence-Based Medicine (EBM). While Retrieval-Augmented Generation (RAG) offers a solution, current systems often rely on static workflows that miss the iterative, hypothetico-deductive reasoning of clinicians. To address this, we introduce Deep-DxSearch, an agentic RAG system trained end-to-end via reinforcement learning (RL) for traceable diagnostic reasoning. Deep-DxSearch acts as an active investigator, treating the LLM as an agent within an environment of 16,000+ guideline-derived disease profiles, 150,000+ patient records for case-based reasoning, and over 27 million biomedical documents. Using soft verifiable rewards that co-optimize retrieval and reasoning, the model learns to formulate queries, evaluate evidence, and refine searches to close diagnostic gaps. Experiments show our end-to-end RL framework consistently outperforms prompt-engineering and training-free RAG methods. On in-distribution (ID) and out-of-distribution (OOD) benchmarks for common and rare diseases, Deep-DxSearch surpasses strong baselines-including GPT-4o, DeepSeek-R1, and medical-specific frameworks-achieving an average accuracy gain of 22.7% over the second-best model. In validation with 150 real-world cases, Deep-DxSearch boosts physicians' average diagnostic accuracy from 45.6% to 69.1%. These results indicate that evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. All data, code, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch.

Paper Structure

This paper contains 27 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Contribution Overview.a. The proposed workflow. Top: The medical retrieval corpus serving as the search environment during both training and inference. Middle: Illustration of the Deep-DxSearch rollout process, where diagnostic trajectories are generated and optimized via reinforcement learning based on trajectory-level rewards. Bottom: An exemplar log demonstrating the system's traceable evidence retrieval. b. Key performance highlights across three dimensions: Deep-DxSearch achieves superior diagnostic accuracy compared to both general-purpose LLMs and specialized medical methods; demonstrates notable clinical utility in physician assistance, surpassing the performance of DeepSeek-R1; and consistently attains high reasoning quality (rated $>$"Good") across five dimensions in both "LLM-as-a-judge" and human evaluations.
  • Figure 1: Data statistics.a. Left: Overview of items and their relationships in the disease guideline. Middle: ICD coverage for common diseases and Orpha coverage for rare diseases. Right: Distribution of disease information sources, highlighting major public resources. b. Top: Summary statistics of patient records. Bottom: Distribution of outliers, illustrating discrepancies between real patient disease-symptom associations and guideline expectations; Breakdown of confirmed patient diagnoses by specialty. c. Summary statistics of the clinical knowledge collection. d. Detailed statistics of the seven-center datasets used for training and evaluation.
  • Figure 1: Data processing procedure. The datasets for training and evaluation are derived from eight data sources and are split into training, evaluation, and evaluation-only sets. The medical retrieval corpus is constructed partially from these datasets as well as additional authoritative online resources.
  • Figure 2: Comparison with previous diagnostic methods.a, Comparison of Deep-DxSearch with general-purpose LLMs---including GPT-4o, GPT-4o with retrieval, and DeepSeek-R1---on common and rare disease diagnosis (averaged across in-distribution datasets). b, Detailed performance breakdown of Deep-DxSearch versus medical-specific systems across individual in-distribution datasets. c, Comparative evaluation of Deep-DxSearch against all these diagnostic methods on out-of-distribution (OOD) datasets. Note, GPT-4o was excluded on Xinhua-Rare due to privacy constraints associated with this in-house dataset.
  • Figure 2: Analysis of reasoning dynamics before and after Deep-DxSearch training. a, b. Statistical comparison of action trajectories for common and rare disease diagnosis. The baseline model (top) exhibits significant algorithmic rigidity, with the fixed sequence "L,M,S,D" ($\texttt{<lookup>} \rightarrow \texttt{<match>} \rightarrow \texttt{<search>} \rightarrow \texttt{<diagnose>}$) accounting for 62.8% and 40.1% of cases, respectively. In contrast, Deep-DxSearch (bottom) demonstrates increased trajectory diversity (e.g., unique trajectory types increased from 22 to 37 for common diseases), indicating a shift from repetitive heuristics to customized diagnostic paths. c, d. Visualization of action flows illustrating the logical progression. The flows confirm a transition from linear, deterministic execution to adaptive, branching investigation suited to case complexity. e, f. Action-to-action transition probability matrices. Post-training matrices reveal the emergence of recursive behaviors (e.g., $\texttt{<search>} \rightarrow \texttt{<search>}$ for self-correction) and a smoothed probability distribution in rare diseases, suggesting that the agent's decisions are conditioned on evolving context rather than fixed rules.
  • ...and 4 more figures