Table of Contents
Fetching ...

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

Kimberly F. Greco, Zongxin Yang, Mengyan Li, Han Tong, Sara Morini Sweet, Alon Geva, Kenneth D. Mandl, Benjamin A. Raby, Tianxi Cai

TL;DR

Rare diseases are often underdiagnosed due to low prevalence and noisy labels in real-world data. WEST addresses this by coupling a small gold-standard set with a large silver-standard, using iterative label refinement within a transformer-based representation learning framework to achieve accurate phenotype classification and meaningful subphenotyping from heterogeneous EHR data. Across pulmonary hypertension and severe asthma case studies, WEST consistently outperforms rule-based methods, KOMAP, and standard supervised baselines, uncovering latent patient structure and prognostic signals through learned embeddings and clustering. By reducing manual annotation needs and leveraging routine EHR data, WEST offers a scalable path to faster diagnosis, improved cohort definition, and data-driven discovery in the rare-disease space.

Abstract

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to their low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is hindered by the scarcity of high-quality labeled data for training. Expert-labeled datasets from chart reviews and registries are clinically accurate but limited in scope and availability, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To address these challenges, we propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs), a framework that combines routinely collected EHR data with a limited set of expert-validated cases and controls to enable large-scale phenotyping. At its core, WEST employs a weakly supervised transformer model trained on extensive probabilistic silver-standard labels - derived from both structured and unstructured EHR features - that are iteratively refined during training to improve model calibration. We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.

A Weakly Supervised Transformer for Rare Disease Diagnosis and Subphenotyping from EHRs with Pulmonary Case Studies

TL;DR

Rare diseases are often underdiagnosed due to low prevalence and noisy labels in real-world data. WEST addresses this by coupling a small gold-standard set with a large silver-standard, using iterative label refinement within a transformer-based representation learning framework to achieve accurate phenotype classification and meaningful subphenotyping from heterogeneous EHR data. Across pulmonary hypertension and severe asthma case studies, WEST consistently outperforms rule-based methods, KOMAP, and standard supervised baselines, uncovering latent patient structure and prognostic signals through learned embeddings and clustering. By reducing manual annotation needs and leveraging routine EHR data, WEST offers a scalable path to faster diagnosis, improved cohort definition, and data-driven discovery in the rare-disease space.

Abstract

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to their low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is hindered by the scarcity of high-quality labeled data for training. Expert-labeled datasets from chart reviews and registries are clinically accurate but limited in scope and availability, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To address these challenges, we propose WEST (WEakly Supervised Transformer for rare disease phenotyping and subphenotyping from EHRs), a framework that combines routinely collected EHR data with a limited set of expert-validated cases and controls to enable large-scale phenotyping. At its core, WEST employs a weakly supervised transformer model trained on extensive probabilistic silver-standard labels - derived from both structured and unstructured EHR features - that are iteratively refined during training to improve model calibration. We evaluate WEST on two rare pulmonary diseases using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically meaningful subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables data-efficient rare disease phenotyping that improves cohort definition, supports earlier and more accurate diagnosis, and accelerates data-driven discovery for the rare disease community.

Paper Structure

This paper contains 30 sections, 24 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the WEST phenotyping pipeline.
  • Figure 2: WEST performance as a function of the number of gold-standard training labels. Curves report AUC and F1 score. The horizontal dotted line represents the best performing baseline, Transformer (gold only).
  • Figure 3: t-SNE visualization of patient-level embeddings for PH phenotypes using TF-IDF and WEST-derived representations.
  • Figure 4: Kaplan-Meier survival curves for PH subgroups identified via k-means clustering of transformer embeddings.
  • Figure 5: t-SNE visualization of patient-level embeddings for severe asthma (SA) phenotypes using TF-IDF and WEST-derived representations.
  • ...and 2 more figures