Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care

Juan Miguel Lopez Alcaraz; Hjalmar Bouma; Nils Strodthoff

Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care

Juan Miguel Lopez Alcaraz, Hjalmar Bouma, Nils Strodthoff

TL;DR

The study tackles the challenge of building clinically useful AI in emergency care by introducing the open MDS-ED multimodal benchmark, which combines demographics, biometrics, vital trends, laboratory values, and raw ECG waveforms to predict a broad set of discharge diagnoses and deterioration events. It demonstrates that multimodal models, particularly those incorporating ECG waveforms via S4-based encoders, outperform unimodal baselines, achieving macro AUROCs of 0.8256 for diagnoses and 0.9115 for deterioration. The work provides a large, openly available dataset and rigorous benchmarking protocol, enabling reproducible evaluation and rapid progress in AI-driven ED decision support. It also discusses the clinical relevance, potential deployment considerations, and future directions for expanding data modalities, explainability, and prospective validation to move toward real-world adoption.

Abstract

Background: AI-driven prediction algorithms have the potential to enhance emergency medicine by enabling rapid and accurate decision-making regarding patient status and potential deterioration. However, the integration of multimodal data, including raw waveform signals, remains underexplored in clinical decision support. Methods: We present a dataset and benchmarking protocol designed to advance multimodal decision support in emergency care. Our models utilize demographics, biometrics, vital signs, laboratory values, and electrocardiogram (ECG) waveforms as inputs to predict both discharge diagnoses and patient deterioration. Results: The diagnostic model achieves area under the receiver operating curve (AUROC) scores above 0.8 for 609 out of 1,428 conditions, covering both cardiac (e.g., myocardial infarction) and non-cardiac (e.g., renal disease, diabetes) diagnoses. The deterioration model attains AUROC scores above 0.8 for 14 out of 15 targets, accurately predicting critical events such as cardiac arrest, mechanical ventilation, ICU admission, and mortality. Conclusions: Our study highlights the positive impact of incorporating raw waveform data into decision support models, improving predictive performance. By introducing a unique, publicly available dataset and baseline models, we provide a foundation for measurable progress in AI-driven decision support for emergency care.

Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care

TL;DR

Abstract

Paper Structure (27 sections, 2 figures, 11 tables)

This paper contains 27 sections, 2 figures, 11 tables.

Introduction
Methods
Clinical workflow and dataset creation
Prediction tasks and targets
Features
Train-test splits
Related work
Model architectures
Training and evaluation
Results
Benchmarking predictive performance
Task-dependent predictive performance
Discussion
Impact of data modalities
Clinical significance
...and 12 more sections

Figures (2)

Figure 1: Pipeline outlines the MDS-ED clinical workflow, which involves feature collection encompassing patient demographics, biometrics (such as height, weight, and BMI), vital parameters and trends, laboratory values and trends, and ECG waveform data to address two clinically relevant prediction scenarios: predicting patient discharge diagnoses out of 1428 cardiac and non-cardiac ICD10-CM codes and predicting patient deterioration according to 15 clinical deterioration measures.
Figure 2: Schematic representation summarizing the creation process of the MDS-ED dataset Lopez_Alcaraz_Strodthoff_2024 underlying this work. The process starts from four different source datasets (MIMIC-IV-ECG, MIMIC-IV, MIMIC-IV-ED, and MIMIC-IV-ECG-ICD) from which we select patients aged 18 years or older where an ECG was collected within the first 90 minutes of arrival at the ED department. Different target values collected at different intervals. On the resulting samples, we procedded to apply a features outlier removal which is primarily error-based, excluding unrealistic values, never-registered extremes, or negative values when the minimum is zero, see Appendix \ref{['app:dataset']} and Lopez_Alcaraz_Strodthoff_2024 for details. Similarly, an imputation mask creation in which we apply median imputation from the train set to validation and test, adding binary masks to indicate imputed values, helping the model learn missingness patterns. Finally, we capture trends through engineered features such as summary statistics (mean, median, min, max, standard deviation), first and last values, rate of change, and slope of fitted linear model based on values within the first 90 minutes after arrival. The resulting MDS-ED dataset comprises 1428 diagnostic labels and 15 deteriorations labels across 129,057 samples from 71,098 patients collected from 121,195 unique visits. The input features cover a single 10s, 12-lead ECG in addition to 470 tabular features (excluding binary masking columns).

Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care

TL;DR

Abstract

Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care

Authors

TL;DR

Abstract

Table of Contents

Figures (2)