ICU-Sepsis: A Benchmark MDP Built from Real Medical Data

Kartik Choudhary; Dhawal Gupta; Philip S. Thomas

ICU-Sepsis: A Benchmark MDP Built from Real Medical Data

Kartik Choudhary, Dhawal Gupta, Philip S. Thomas

TL;DR

ICU-Sepsis addresses the need for a standardized, real-data–derived benchmark to evaluate reinforcement learning methods for sepsis management in the ICU. It builds a tabular MDP from the MIMIC-III dataset with $| S|=716$ states and $| A^+|=25$ actions, a discount factor $ abla=1$, and a final-survival reward $R=+1$, enabling reproducible benchmarking through downloadable CSV dynamics and Gym-compatible code. The work demonstrates the approach by comparing multiple RL algorithms (e.g., Sarsa, Q-Learning, DQN, SAC, PPO) and shows that convergence requires hundreds of thousands of episodes, with some methods nearing near-optimal performance while others lag, highlighting both the potential and the challenges of RL in realistic healthcare benchmarks. By providing a lightweight, privacy-preserving, and broadly compatible environment, ICU-Sepsis offers researchers a practical tool for comparing RL algorithms on an important, real-world problem, while explicitly avoiding clinical prescriptions or guidance for patient care.

Abstract

We present ICU-Sepsis, an environment that can be used in benchmarks for evaluating reinforcement learning (RL) algorithms. Sepsis management is a complex task that has been an important topic in applied RL research in recent years. Therefore, MDPs that model sepsis management can serve as part of a benchmark to evaluate RL algorithms on a challenging real-world problem. However, creating usable MDPs that simulate sepsis care in the ICU remains a challenge due to the complexities involved in acquiring and processing patient data. ICU-Sepsis is a lightweight environment that models personalized care of sepsis patients in the ICU. The environment is a tabular MDP that is widely compatible and is challenging even for state-of-the-art RL algorithms, making it a valuable tool for benchmarking their performance. However, we emphasize that while ICU-Sepsis provides a standardized environment for evaluating RL algorithms, it should not be used to draw conclusions that guide medical practice.

ICU-Sepsis: A Benchmark MDP Built from Real Medical Data

TL;DR

states and

actions, a discount factor

, and a final-survival reward

, enabling reproducible benchmarking through downloadable CSV dynamics and Gym-compatible code. The work demonstrates the approach by comparing multiple RL algorithms (e.g., Sarsa, Q-Learning, DQN, SAC, PPO) and shows that convergence requires hundreds of thousands of episodes, with some methods nearing near-optimal performance while others lag, highlighting both the potential and the challenges of RL in realistic healthcare benchmarks. By providing a lightweight, privacy-preserving, and broadly compatible environment, ICU-Sepsis offers researchers a practical tool for comparing RL algorithms on an important, real-world problem, while explicitly avoiding clinical prescriptions or guidance for patient care.

Abstract

Paper Structure (28 sections, 4 equations, 6 figures, 11 tables)

This paper contains 28 sections, 4 equations, 6 figures, 11 tables.

Introduction
Background
Technical setting
Sepsis management
RL for sepsis treatment
Software and Data
The environment parameters and implementation
The ICU-Sepsis Environment
Formulating sepsis management as a reinforcement learning problem
The ICU-Sepsis dataset
Constructing the ICU-Sepsis MDP
Computing the final parameters
Additional environment details
Experiments
Methodology
...and 13 more sections

Figures (6)

Figure 1: Illustration of one episode in the ICU-Sepsis environment. The clinician treats the patient through actions, which affect how their state evolves over time, until the patient is discharged (and a positive reward is received), or the patient dies (and no reward is received).
Figure 2: (Left) The learning curves for five algorithms on the ICU-Sepsis MDP. (Right) Average episode lengths during the learning process. Each curve is averaged over $1,\!000$ random seeds, where the error bars represent one unit of standard error.
Figure 3: Distribution of the number of admissible actions for different states in the ICU-Sepsis environment.
Figure 4: (Left) The learning curves for five algorithms on the Variant MDP. (Right) A plot depicting the average episode lengths during the learning process. Each curve is averaged over $20$ random seeds, where the error bars represent one unit of standard error.
Figure 5: Illustration of the perturbation process. (a) Admissible actions for different states. Each row has a state (in bold) followed by the list of admissible actions in that state. (b) Some admissible actions are randomly chosen and made inadmissible. (c) Remaining admissible actions. This can cause some states (in this case $\text{S}_3$) to have no admissible actions left. (d) For states where there are no admissible actions left, a previously admissible action is chosen and reintroduced as an admissible action. Thus, every state still has at least one admissible action after the perturbation process.
...and 1 more figures

ICU-Sepsis: A Benchmark MDP Built from Real Medical Data

TL;DR

Abstract

ICU-Sepsis: A Benchmark MDP Built from Real Medical Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)