ICU-Sepsis: A Benchmark MDP Built from Real Medical Data
Kartik Choudhary, Dhawal Gupta, Philip S. Thomas
TL;DR
ICU-Sepsis addresses the need for a standardized, real-data–derived benchmark to evaluate reinforcement learning methods for sepsis management in the ICU. It builds a tabular MDP from the MIMIC-III dataset with $| S|=716$ states and $| A^+|=25$ actions, a discount factor $ abla=1$, and a final-survival reward $R=+1$, enabling reproducible benchmarking through downloadable CSV dynamics and Gym-compatible code. The work demonstrates the approach by comparing multiple RL algorithms (e.g., Sarsa, Q-Learning, DQN, SAC, PPO) and shows that convergence requires hundreds of thousands of episodes, with some methods nearing near-optimal performance while others lag, highlighting both the potential and the challenges of RL in realistic healthcare benchmarks. By providing a lightweight, privacy-preserving, and broadly compatible environment, ICU-Sepsis offers researchers a practical tool for comparing RL algorithms on an important, real-world problem, while explicitly avoiding clinical prescriptions or guidance for patient care.
Abstract
We present ICU-Sepsis, an environment that can be used in benchmarks for evaluating reinforcement learning (RL) algorithms. Sepsis management is a complex task that has been an important topic in applied RL research in recent years. Therefore, MDPs that model sepsis management can serve as part of a benchmark to evaluate RL algorithms on a challenging real-world problem. However, creating usable MDPs that simulate sepsis care in the ICU remains a challenge due to the complexities involved in acquiring and processing patient data. ICU-Sepsis is a lightweight environment that models personalized care of sepsis patients in the ICU. The environment is a tabular MDP that is widely compatible and is challenging even for state-of-the-art RL algorithms, making it a valuable tool for benchmarking their performance. However, we emphasize that while ICU-Sepsis provides a standardized environment for evaluating RL algorithms, it should not be used to draw conclusions that guide medical practice.
