An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie
TL;DR
The paper presents DeepRare, an agentic large language model framework designed to address the diagnostic odyssey in rare diseases. It employs a three-tier Model Context Protocol inspired architecture with a central host, specialized agent servers, and web-scale medical knowledge sources to process heterogeneous inputs, including free text, HPO terms, and genomic data, generating a ranked differential diagnosis with traceable, evidence-based rationales. Across eight diverse diagnostic datasets and 14 medical specialties, DeepRare achieves superior Recall@1 and Recall@3 performance, outperforms expert clinicians in HPO-guided diagnosis, and demonstrates robustness on long-tail diseases, especially when incorporating genotype data. The system emphasizes interpretability through verifiable references and a self-reflective loop to mitigate hallucinations, and it is deployed via a user-friendly web application to integrate with clinical workflows. These results suggest that agentic LLMs can reshape rare disease workflows by delivering accurate, transparent, and scalable decision support grounded in medical evidence.
Abstract
Rare diseases affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains an urgent challenge. Patients often endure a prolonged diagnostic odyssey exceeding five years, marked by repeated referrals, misdiagnoses, and unnecessary interventions, leading to delayed treatment and substantial emotional and economic burdens. Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 3,134 diseases. In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability. Our work not only advances rare disease diagnosis but also demonstrates how the latest powerful large-language-model-driven agentic systems can reshape current clinical workflows.
