Subpopulation-Specific Synthetic EHR for Better Mortality Prediction

Oriel Perets; Nadav Rappoport

Subpopulation-Specific Synthetic EHR for Better Mortality Prediction

Oriel Perets, Nadav Rappoport

TL;DR

This work proposes a novel ensemble framework that utilizes generative models to address underrepresentation of certain subpopulations in EHRs, and shows increased model performance over underrepresented SPs.

Abstract

Electronic health records (EHR) often contain different rates of representation of certain subpopulations (SP). Factors like patient demographics, clinical condition prevalence, and medical center type contribute to this underrepresentation. Consequently, when training machine learning models on such datasets, the models struggle to generalize well and perform poorly on underrepresented SPs. To address this issue, we propose a novel ensemble framework that utilizes generative models. Specifically, we train a GAN-based synthetic data generator for each SP and incorporate synthetic samples into each SP training set. Ultimately, we train SP-specific prediction models. To properly evaluate this method, we design an evaluation pipeline with 2 real-world use case datasets, queried from the MIMIC database. Our approach shows increased model performance over underrepresented SPs. Our code and models are given as supplementary and will be made available on a public repository.

Subpopulation-Specific Synthetic EHR for Better Mortality Prediction

TL;DR

Abstract

Paper Structure (15 sections, 2 figures, 2 tables)

This paper contains 15 sections, 2 figures, 2 tables.

Background
Related Work
Preprocessing Approaches
Algorithmic Approaches
Synthetic Data for Performance Boosting
Relevant Techniques
Method
Proposed method
Datasets and Prediction Tasks
Subpopulation Definitions
Ensemble-GAN Training
Prediction Models
Evaluation Pipeline
Experiments and Results
Discussion

Figures (2)

Figure 1: Method flow diagram; We query a cohort for the specific ML task, then split the dataset to SPs using the chosen PM, split each SP to training and validation sets, train the synthetic data generator, augment the dataset with synthetic samples, train the prediction model, and evaluate the performance using the validation set comprised of real samples. We then iterate the process for various amounts of synthetic samples.
Figure 2: SP sizes per use case ordered by size.

Subpopulation-Specific Synthetic EHR for Better Mortality Prediction

TL;DR

Abstract

Subpopulation-Specific Synthetic EHR for Better Mortality Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)