Table of Contents
Fetching ...

MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

Ruohan Wang, Zilong Wang, Ziyang Song, David Buckeridge, Yue Li

TL;DR

A guided topic model, MixEHR-Nest, to infer subphenotype topics from thousands of disease using multi-modal EHR data, which is predictive for disease progression and severity and evaluated on two EHR datasets.

Abstract

Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups and enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.

MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

TL;DR

A guided topic model, MixEHR-Nest, to infer subphenotype topics from thousands of disease using multi-modal EHR data, which is predictive for disease progression and severity and evaluated on two EHR datasets.

Abstract

Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups and enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.

Paper Structure

This paper contains 39 sections, 12 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Schematic of MixEHR-Nest on the MIMIC EHR data. (a) Subphenotype prior guidance. For each patient d, MixEHR-Nest initializes its phenotype topic prior $\alpha_{d,k_m}$ for the subtopic $m$ of the phenotype topic $k$ by computing PheCode occurrence. (b) Multi-modal EHR modeling. MixEHR-Nest learns multi-modal phenotype topics $\bm{\upphi}^{(t)}$ for the modality $t$. (c) Graphical model of MixEHR-Nest. The topic mixture $\bm{\uptheta}_{d}$ is drawn from the Dirichlet distribution with $\bm{\upalpha}_d$. For an EHR token $i$ from the modality $t$, the topic assignment $z_{id}$ is sampled from a categorical distribution with $\bm{\uptheta}_d$. Given the topic assignment $z_{id}$ = ${k_m}$, the EHR token $x_{id}$ is then sampled from a categorical distribution with $\bm{\upphi}^{(t)}_{z_{id}}$.
  • Figure 2: Top ICD codes inferred by MixEHR-Nest for the CCS-guided phenotype topics from the MIMIC-III data for low birth and diabetic phenotypes. As a proof-of-concept, we used PheCode to label ICD9 codes to show that the subtopics of the CCS codes we found reflect the PheCode system, which was not used to train the model. (a) 3 subtopics per phenotype (M=3). (b) 4 subtopics per phenotype (M=4).
  • Figure 3: Top K precision of ICU mortality prediction.
  • Figure 4: Analysis of high-risk mortality disease predicted by MixEHR-Nest. SHAP summary plot illustrating the impact of the top 20 high-risk disease subphenotypes on model output for 100 high-risk patients who died in the ICU. Each dot represents the SHAP value for a subphenotype $k_m$ in a sample $d$, with color indicating MixEHR-Nest estimated $\hat{\uptheta}_{dk_m}$.
  • Figure 5: High-Risk Mortality Phenotype. The first column indicates the predominant ICD codes for each topic, with sidebars highlighting the PheCode-defining ICD-9 codes. The figure at the second column indicates the primary medications correlated with each topic. The figure at the third column indicates the chief CPT codes associated with each topic. The figure at the last column indicates key feature from doctor notes related to each topic. (a) Cirrhosis of liver without alcohol subphenotypes (571.51). (b) Other conditions of brain (348). (c) Bone marrow or stem cell transplant (860).
  • ...and 6 more figures