Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Robin van de Water; Hendrik Schmidt; Paul Elbers; Patrick Thoral; Bert Arnrich; Patrick Rockenschaub

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Robin van de Water, Hendrik Schmidt, Paul Elbers, Patrick Thoral, Bert Arnrich, Patrick Rockenschaub

TL;DR

The paper introduces Yet Another ICU Benchmark (YAIB), a modular, open-source framework for reproducible, cross-dataset clinical ML research in intensive care. By harmonizing data via the ricu/clinical-concept approach and providing end-to-end workflow support—from cohort definition to model evaluation—YAIB enables apples-to-apples comparisons across multiple public ICU datasets (MIMIC-III/IV, eICU, HiRID, AUMCdb) and predefined prediction tasks. Empirical results show that data choice and preprocessing decisions often have larger effects on performance than model class, underscoring the need for holistic benchmarking and external validation. The framework also supports transfer learning and domain adaptation, and is designed for extensibility to new datasets and tasks, promoting rapid, reproducible method development in clinical ML.

Abstract

Medical applications of machine learning (ML) have experienced a surge in popularity in recent years. The intensive care unit (ICU) is a natural habitat for ML given the abundance of available data from electronic health records. Models have been proposed to address numerous ICU prediction tasks like the early detection of complications. While authors frequently report state-of-the-art performance, it is challenging to verify claims of superiority. Datasets and code are not always published, and cohort definitions, preprocessing pipelines, and training setups are difficult to reproduce. This work introduces Yet Another ICU Benchmark (YAIB), a modular framework that allows researchers to define reproducible and comparable clinical ML experiments; we offer an end-to-end solution from cohort definition to model evaluation. The framework natively supports most open-access ICU datasets (MIMIC III/IV, eICU, HiRID, AUMCdb) and is easily adaptable to future ICU datasets. Combined with a transparent preprocessing pipeline and extensible training code for multiple ML and deep learning models, YAIB enables unified model development. Our benchmark comes with five predefined established prediction tasks (mortality, acute kidney injury, sepsis, kidney function, and length of stay) developed in collaboration with clinicians. Adding further tasks is straightforward by design. Using YAIB, we demonstrate that the choice of dataset, cohort definition, and preprocessing have a major impact on the prediction performance - often more so than model class - indicating an urgent need for YAIB as a holistic benchmarking tool. We provide our work to the clinical ML community to accelerate method development and enable real-world clinical implementations. Software Repository: https://github.com/rvandewater/YAIB.

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

TL;DR

Abstract

Paper Structure (74 sections, 2 equations, 11 figures, 29 tables)

This paper contains 74 sections, 2 equations, 11 figures, 29 tables.

Introduction
Related work
Benchmark design
Design philosophy
Clinical concepts
Patient cohort and task definition
Preprocessing and feature extraction
Training and evaluation
Experiments
Models and experimental setup
Benchmarking baseline models on major ICU datasets
Using YAIB as an experimental ML framework
Transfer learning
Discussion
Conclusion
...and 59 more sections

Figures (11)

Figure 1: Schematic overview of benchmark pipeline. On the left side, the creation of harmonized ICU cohorts is shown. Note that the domain expertise of clinicians is often necessary for defining clinically useful tasks. The schematic overview of the benchmark stages can be found on the right. Note that the dotted line indicates that this component can be easily extended, as it follows an abstracted interface.
Figure 2: Performance of prediction models when trained on one dataset (row) and evaluated on all others (columns).Left: Performance in AUROC of the GRU model on ICU mortality. Right: Performance in AUPRC for the same models. Pooled (d-1) refers to training a model on every dataset except the evaluation dataset.
Figure 3: Fine-tuning an eICU model for ICU mortality prediction on HiRID.
Figure 4: Performance in MAE of the Transformer model on los.
Figure 5: AUPRC for fine-tuning an eICU GRU model for ICU mortality prediction on HiRID.
...and 6 more figures

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

TL;DR

Abstract

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Authors

TL;DR

Abstract

Table of Contents

Figures (11)