Table of Contents
Fetching ...

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie

TL;DR

The paper presents DeepRare, an agentic large language model framework designed to address the diagnostic odyssey in rare diseases. It employs a three-tier Model Context Protocol inspired architecture with a central host, specialized agent servers, and web-scale medical knowledge sources to process heterogeneous inputs, including free text, HPO terms, and genomic data, generating a ranked differential diagnosis with traceable, evidence-based rationales. Across eight diverse diagnostic datasets and 14 medical specialties, DeepRare achieves superior Recall@1 and Recall@3 performance, outperforms expert clinicians in HPO-guided diagnosis, and demonstrates robustness on long-tail diseases, especially when incorporating genotype data. The system emphasizes interpretability through verifiable references and a self-reflective loop to mitigate hallucinations, and it is deployed via a user-friendly web application to integrate with clinical workflows. These results suggest that agentic LLMs can reshape rare disease workflows by delivering accurate, transparent, and scalable decision support grounded in medical evidence.

Abstract

Rare diseases affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains an urgent challenge. Patients often endure a prolonged diagnostic odyssey exceeding five years, marked by repeated referrals, misdiagnoses, and unnecessary interventions, leading to delayed treatment and substantial emotional and economic burdens. Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 3,134 diseases. In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability. Our work not only advances rare disease diagnosis but also demonstrates how the latest powerful large-language-model-driven agentic systems can reshape current clinical workflows.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

TL;DR

The paper presents DeepRare, an agentic large language model framework designed to address the diagnostic odyssey in rare diseases. It employs a three-tier Model Context Protocol inspired architecture with a central host, specialized agent servers, and web-scale medical knowledge sources to process heterogeneous inputs, including free text, HPO terms, and genomic data, generating a ranked differential diagnosis with traceable, evidence-based rationales. Across eight diverse diagnostic datasets and 14 medical specialties, DeepRare achieves superior Recall@1 and Recall@3 performance, outperforms expert clinicians in HPO-guided diagnosis, and demonstrates robustness on long-tail diseases, especially when incorporating genotype data. The system emphasizes interpretability through verifiable references and a self-reflective loop to mitigate hallucinations, and it is deployed via a user-friendly web application to integrate with clinical workflows. These results suggest that agentic LLMs can reshape rare disease workflows by delivering accurate, transparent, and scalable decision support grounded in medical evidence.

Abstract

Rare diseases affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains an urgent challenge. Patients often endure a prolonged diagnostic odyssey exceeding five years, marked by repeated referrals, misdiagnoses, and unnecessary interventions, leading to delayed treatment and substantial emotional and economic burdens. Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 3,134 diseases. In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability. Our work not only advances rare disease diagnosis but also demonstrates how the latest powerful large-language-model-driven agentic systems can reshape current clinical workflows.

Paper Structure

This paper contains 44 sections, 17 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: DeepRare: An agentic framework for rare disease prioritization.a System workflow: Multi-modal patient data (HPO terms, genomic variants) are processed through a tiered MCP-inspired architecture, generating a ranked Top-K diagnosis list with evidence-supported reasoning chains. b Knowledge architecture: Sunburst visualization depicting hierarchical integration of diagnostic tools and biomedical knowledge sources within DeepRare. c Performance benchmarking: Comparative evaluation across diagnostic APIs, general-purpose LLMs, reasoning-enhanced LLMs, medically-tuned LLMs, and agentic systems. Created in BioRender. https://BioRender.com/ija3tl0
  • Figure 1: Overview of the DeepRare system.a The input consists of patient free-text information, structured HPO IDs, or any combination of them. b The three-level components in RaraDx. Inspired by the MCP, our system can also be analogized to a personal computer system architecture, comprising: (1) a central host with a memory bank for centrally managing and coordinating the system, analogous to the main computer processing system; (2) multiple agent servers to organize tools, execute specific tasks, and interact with the external environment, analogous to auxiliary hardware assistant equipment; (3) comprehensive external data sources, representing a complete external rare-disease diagnostic environment, supporting the entire system by various medical reliable evidence, including medical knowledge and clinical cases. c The flowchart of the main workflow of our system illustrates two primary stages, i.e., the information collection stage and the self-reflection diagnosis stage. In the former, the central host actively collects medical support information relevant to the patient. In the latter, the central host performs self-reflection on its diagnostic results. Steps involving the central host are highlighted in blue boxes within the flowchart. Created in BioRender. https://BioRender.com/w2vqp03
  • Figure 2: HPO-wise cross-dataset evaluation and comparative performance of DeepRare.a Diagnostic accuracy on seven public rare disease registries, demonstrating DeepRare's significant advantage over leading baselines – particularly in RareBench-MME (70.0% top-1 accuracy) and RareBench-RAMEDIS (72.6% top-1 accuracy). b Superior performance consistency on the Xinhua Hospital cohort (local model evaluation only due to privacy consideration).
  • Figure 2: Cohort curation pipeline and allocation strategy for MIMIC-IV-Note and Xinhua Hospital datasets. Left: MIMIC-IV-note dataset including 331,794 cases, with 9,185 remaining after exclusions and divided into test (n = 1,875) and library (n = 7,310) sets. Right: Xinhua Hospital dataset including 352,425 cases, with 5,820 remaining after exclusions and divided into test (n = 975) and library (n = 4,845) sets. Both datasets underwent rare disease checks and information completeness filtering.
  • Figure 3: DeepRare's diagnostic performance.a Comparison of diagnostic accuracy across fourteen body systems: showing DeepRare's superior performance in most specialties compared to LLM (DeepSeek-V3), Reasoning LLM (DeepSeek-R1), and Medical LLM (MedIns). b Disease-level recall performance comparison for diseases with $>10$ cases, showing DeepRare's consistent superiority. c Real-world clinical validation study: Diagnostic recall performance comparison of specialized rare disease physicians (10+ years experience with search engine), LLM (DeepSeek V3), and DeepRare using unprocessed outpatient clinical narratives (free-text context only).d Diagnosis performance with HPO and gene data input compared with baseline method and only HPO input.
  • ...and 3 more figures