Table of Contents
Fetching ...

MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records

Yixuan Li, Archer Y. Yang, Ariane Marelli, Yue Li

TL;DR

MixEHR-SurG tackles interpretable mortality risk prediction from high-dimensional, multi-modal EHRs by integrating topic modeling with a Cox proportional hazards survival model. It extends MixEHR with a PheCode-guided topic prior and a survival supervision component, enabling joint inference of mortality-associated phenotype topics and time-to-event outcomes. Across simulations and real-world data (Quebec CHD and MIMIC-III), it achieves competitive or superior dynamic AUCs (e.g., ~0.89 in simulation, ~0.645 in CHD) and identifies clinically coherent mortality-related topics that bridge ICD codes, notes, and other EHR modalities. The approach yields interpretable, topic-level insights into mortality risk and demonstrates potential utility for phenotype discovery and personalized prognosis in healthcare.

Abstract

Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as mortality or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge.

MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records

TL;DR

MixEHR-SurG tackles interpretable mortality risk prediction from high-dimensional, multi-modal EHRs by integrating topic modeling with a Cox proportional hazards survival model. It extends MixEHR with a PheCode-guided topic prior and a survival supervision component, enabling joint inference of mortality-associated phenotype topics and time-to-event outcomes. Across simulations and real-world data (Quebec CHD and MIMIC-III), it achieves competitive or superior dynamic AUCs (e.g., ~0.89 in simulation, ~0.645 in CHD) and identifies clinically coherent mortality-related topics that bridge ICD codes, notes, and other EHR modalities. The approach yields interpretable, topic-level insights into mortality risk and demonstrates potential utility for phenotype discovery and personalized prognosis in healthcare.

Abstract

Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as mortality or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge.
Paper Structure (29 sections, 45 equations, 16 figures, 1 algorithm)

This paper contains 29 sections, 45 equations, 16 figures, 1 algorithm.

Figures (16)

  • Figure 1: MixEHR-SurG overview. MixEHR-SurG consists of four main steps. The training process is highlighted in green, and the prediction process is depicted in purple. In Step 1, we prepossess and aggregate raw EHR data for each patient $j$. Step 2 involves determining a $K$-dimensional phenotype topic prior, $\boldsymbol{\uppi}_j = (\pi_{j1}, \ldots, \pi_{jK})$, for each patient. Step 3 infers phenotype topic distribution $\boldsymbol{\upphi}_k^{(m)} \in \mathbb{R}^{V^{(m)}}$ for EHR type $m$ in topic $k$ (i.e., the model parameters of MixEHR-SurG). This requires inferring the latent topic assignment $z_{ji}\in\{1,\ldots,K\}$ for each EHR token $i$ in patient $j$. In Step 4, the trained model is applied to predict personalized survival function for new patient. The details of the probabilistic graphical model is depicted in Fig.\ref{['fig:MixEHRs_diagrams']}.
  • Figure 2: Probabilistic graphical model (PGM) illustration of four models variants. (a) PGM for MixEHR. We first generate topic distributions $\boldsymbol{\upphi}_k^{(m)}$ for each topic $k$ and document type $m$, then we generate of a $K$-dimensional topic proportion $\boldsymbol{\uptheta}_j$ for every patient $j$. Finally, we generate latent topics $z_{ji}^{(m)}$ and corresponding words $x_{ji}^{(m)}$ for each EHR token. (b) PGM for MixEHR-G. We infer patient specif PheCode-Guided topic prior $\boldsymbol{\uppi}_j$ for each patient $j$ and used it as Dirichlet hyperparameters for the patient topic mixture $\boldsymbol{\uptheta}_j$ enclosed by a blue dashed rectangular. (c) PGM for MixEHR-Surv. For each patient $j$, we obtained the survival time $T_j$ and employed the Cox proportional hazards (PH) model with coefficient $\mathbf{w}$ and baseline hazard function $h_0(\cdot)$ to guide the learning of topics, as enclosed by a green dashed rectangular. (d) PGM for the proposed MixEHR-SurG. We combine both PheCode-Guided prior and survival information into one single model. The resulting model can use the guided phenotype topics to model the Cox PH of survival likelihood
  • Figure 3: Simulation Results for MixEHR-SurG. (a) Scatter plot comparing the estimated coefficients $\mathbf{w}$ (in green) with their true values (in blue). (b) ROC curve for predicting zero coefficients. (c) Dynamic AUC curves to evaluate survival time prediction.
  • Figure 4: Dynamic AUC curves for predicting time to death in CHD patients. We built a series of time points starting from 20 and incrementing by 20 up to 755. For each of these time points, we computed the cumulative AUC, which then formed the Dynamic AUC curve. The mean AUC over time for each method was indicated as dash lines and in the bracket after each method in the legend. The compared methods are: Coxnet-MixEHR: A pipeline approach by training MixEHR first and then training a Cox elastic net (Coxnet) using the topic mixture from MixEHR as the input features; MixEHR-Surv: MixEHR with the Cox supervision but without the phecode guided prior for the topic inference; Coxnet-MixEHR-G: A pipeline approach by training MixEHR-G first and then training a Cox elastic net (Coxnet) using the topic mixture from MixEHR-G as the input features; MixEHR-SurG: the proposed method in this paper; Coxnet-ICD9: Cox elastic net (Coxnet) using ICD9 code as input features; Coxnet-PheCode: Coxnet using PheCode as input features; Coxnet-AutoEncoder: Coxnet using the output of an autoencoder as input features; DeepSurv-PheCode: Deep survival model using PheCode as input features.
  • Figure 5: Mortality-related phenotypes for CHD patients who experienced first heart failure hospitalization. (a) Bar plot of the survival regression coefficients $\mathbf{w}$. The effect size of the 10 most positive and the 10 most negative phenotypes are displyaed as barplot. The positive value refers to phenotypes that are associated with high risk of mortality and the negative value refers to phenotypes associated with low mortality risk. The inset at the up-left corner contains the bar plot for all the estimated $w_k, k =1,\ldots,K$ ranked from the largest value to the smallest value. The top 3 and bottom phenotypes were colored in blue and red, respectively. (b) The survival curves of patient with high and low risk of nonrheumatic pulmonary valve disorder (NPVD) (395.4). Patients were divided into two groups based on their topic proportions. The red curve represents patients with a higher topic proportion (top 30%) in NPVD as shown by a significantly steeper decline and lower survival probability over time. The green curve, representing patients with lower topic proportions of NPVD phenotype, shows a more gradual decline, reflecting a comparatively lower risk of mortality. (c) Effect size of the mortality-related phenotypes. We ran simple Cox regression per phenotype topic to obtain their marginal effect size and 95% confidence interval of the top 3 high risk and bottom 3 low risk mortality-associated phenotypes as identified by MixEHR-SurG in panel (a). Points indicate the coefficient values, Error bars show the 95% confidence intervals, and colors represent the significance levels of these coefficients. (d) Heatmap featuring the top ICD-9 codes from the three most positively predictive and three most negatively predictive phenotypes as determined by from MixEHR-SurG. The intensity of the colors indicates the topic probability in under each topic. The magnitude of the Cox coefficients are displayed in the last row.
  • ...and 11 more figures