A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

Junyi Gao; Yinghao Zhu; Wenqing Wang; Yasha Wang; Wen Tang; Ewen M. Harrison; Liantao Ma

A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

Junyi Gao, Yinghao Zhu, Wenqing Wang, Yasha Wang, Wen Tang, Ewen M. Harrison, Liantao Ma

TL;DR

This work tackles the need for a fair, reproducible benchmark for COVID-19 ICU outcomes by introducing two clinically grounded tasks: an Outcome-specific LOS prediction and an Early mortality prediction task, evaluated on two real-world ICU EHR datasets. The authors design robust preprocessing pipelines, a diverse set of baselines including EHR-specific DL models, and novel metrics ($OSMAE$ and $ES$) along with a time-aware loss to enable early and accurate risk signaling. They show that multi-task learning and time-aware optimization generally improve early and outcome-specific predictions, with notable performance differences across TJH and CDSL, and they provide an online platform to share results and models to support clinical adoption. This benchmark advances practical, fair comparison of predictive methods for COVID-19 in ICUs and can guide future research and deployment in resource-constrained, time-critical settings.

Abstract

The COVID-19 pandemic has posed a heavy burden to the healthcare system worldwide and caused huge social disruption and economic loss. Many deep learning models have been proposed to conduct clinical predictive tasks such as mortality prediction for COVID-19 patients in intensive care units using Electronic Health Record (EHR) data. Despite their initial success in certain clinical applications, there is currently a lack of benchmarking results to achieve a fair comparison so that we can select the optimal model for clinical use. Furthermore, there is a discrepancy between the formulation of traditional prediction tasks and real-world clinical practice in intensive care. To fill these gaps, we propose two clinical prediction tasks, Outcome-specific length-of-stay prediction and Early mortality prediction for COVID-19 patients in intensive care units. The two tasks are adapted from the naive length-of-stay and mortality prediction tasks to accommodate the clinical practice for COVID-19 patients. We propose fair, detailed, open-source data-preprocessing pipelines and evaluate 17 state-of-the-art predictive models on two tasks, including 5 machine learning models, 6 basic deep learning models and 6 deep learning predictive models specifically designed for EHR data. We provide benchmarking results using data from two real-world COVID-19 EHR datasets. One dataset is publicly available without needing any inquiry and another dataset can be accessed on request. We provide fair, reproducible benchmarking results for two tasks. We deploy all experiment results and models on an online platform. We also allow clinicians and researchers to upload their data to the platform and get quick prediction results using our trained models. We hope our efforts can further facilitate deep learning and machine learning research for COVID-19 predictive modeling.

A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

TL;DR

and

) along with a time-aware loss to enable early and accurate risk signaling. They show that multi-task learning and time-aware optimization generally improve early and outcome-specific predictions, with notable performance differences across TJH and CDSL, and they provide an online platform to share results and models to support clinical adoption. This benchmark advances practical, fair comparison of predictive methods for COVID-19 in ICUs and can guide future research and deployment in resource-constrained, time-critical settings.

Abstract

Paper Structure (28 sections, 5 equations, 17 figures, 19 tables)

This paper contains 28 sections, 5 equations, 17 figures, 19 tables.

Introduction
Dataset Description and Problem Formulation
EHR datasets for COVID-19 patients in intensive care
Problem formulation and evaluation metrics
Pipeline Design
Data preprocessing
Benchmarking experiment settings
Model adjustments for the proposed tasks
Results
Benchmarking performance of outcome-specific length-of-stay prediction
Benchmarking performance of early mortality prediction
Discussions
Analysis of early prediction performance
Case Study of Prediction Metrics
Outcome-specific prediction performance
...and 13 more sections

Figures (17)

Figure 1: Illustrations of the proposed OSMAE and ES metrics.
Figure 2: The K-fold cross-validation strategy. We take 4-fold as an example in the figure. We use a stratified shuffle split to ensure the proportions of alive and dead patients on all folds are the same as the total cohort.
Figure 3: Illustrations of the two-stage training and multi-task training settings
Figure 4: Early prediction performance of 5 models with the highest ES on the CDSL dataset. All models are trained using the first half of patient records. Error bars are standard deviations. All performance improvements are statistically significant (p-value < 0.05).
Figure 5: AUROC of Dr. Agent and Dr. Agent-TA at each visit
...and 12 more figures

A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

TL;DR

Abstract

A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

Authors

TL;DR

Abstract

Table of Contents

Figures (17)