Table of Contents
Fetching ...

An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Cathy Shyr, Yan Hu, Rory J. Tinker, Thomas A. Cassini, Kevin W. Byram, Rizwan Hamid, Daniel V. Fabbri, Adam Wright, Josh F. Peterson, Lisa Bastarache, Hua Xu

TL;DR

RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.

Abstract

Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.

An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

TL;DR

RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.

Abstract

Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.
Paper Structure (17 sections, 5 figures, 2 tables)

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of RARE-PHENIX. RARE-PHENIX is an end-to-end AI system for automating the extraction, standardization, and prioritization of rare disease phenotypes from unstructured clinical text. This system consists of three modules for 1) extracting rare disease features from clinical notes with large language models (LLMs); 2) standardizing these features to structured Human Ontology Phenotype (HPO) terms using retrieval-augmented generation; and 3) prioritize diagnostically informative HPO terms using a supervised ranking model. LLMs include LLaMA-2-chat (7b, 13b, and 70b), LLaMA-3-instruct (8b, and 70b) LLaMA-3.1-instruct (8b and 70b), LLaMA-3.2-instruct (1b and 3b), LLaMA-3.3-instruct (70b), and a secure instance of Azure OpenAI's ChatGPT-4o (v2024-06-01) provisioned for handling protected health information in accordance with institutional data governance policies.
  • Figure 2: End-to-end performance results of RARE-PHENIX and PhenoBERT on the external validation cohort. For legibility, only the top-performing large language models are shown in the figure (i.e., ChatGPT-4o, LLaMA-2-70b, LLaMA-3-70b, LLaMA 3.1-70b) in addition to the baseline comparator (PhenoBERT) across top-$k$ cutoffs. The end-to-end performance results of other RARE-PHENIX configurations are provided in Supplementary Table S2.
  • Figure 3: Module-based Ablation Analysis Results of RARE-PHENIX Across Extraction and Standardization Modules
  • Figure 4: Contribution of phenotype prioritization to diagnostic utility. Improvement in performance using the prioritization module (Module 3) relative to a random ordering of the same extracted phenotypes. For each patient, extracted HPO terms were randomly permuted 200 times, and performance was evaluated at top-$k$ cutoffs ($k = 10, 20, 30, 40, 50$). Values represent the mean difference ($\Delta$ = ranking by Module 3 $-$ ranking by random ordering), and shaded regions indicate 95% bootstrap intervals obtained by resampling at the patient level.
  • Figure 5: Results of systematic error analysis. False negatives and false positives of RARE-PHENIX with the best-performing large language model configurations (ChatGPT-4o, LLaMA-2-70b, LLaMA-3-70b, LLaMA-3.1-70b) and the baseline comparator (PhenoBERT) at different top-k cutoffs $(k = 10, 20, 30, 40, 50)$.