Table of Contents
Fetching ...

Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment

Yingchuan Sun, Shengpu Tang

TL;DR

This study systematically investigates how time-step discretization $\Delta t$ affects offline RL for sepsis treatment using MIMIC-III data, evaluating $\Delta t$ values of 1, 2, 4, and 8 hours within an identical pipeline. It introduces cross-$\Delta t$ policy evaluation via action mapping and compares two BCQ architectures, finding that finer time steps (1–2 h) with a static behavior policy yield the most stable and high-performing policies. The work demonstrates that $\Delta t$ fundamentally alters state representation learning, policy training, and off-policy evaluation, arguing that time-step size should be treated as a core design choice beyond the conventional 4-hour setup. It also provides a robust framework for fair cross-time-step comparisons in healthcare RL, paving the way for more granular, clinically aligned policy learning and evaluation.

Abstract

Existing studies on reinforcement learning (RL) for sepsis management have mostly followed an established problem setup, in which patient data are aggregated into 4-hour time steps. Although concerns have been raised regarding the coarseness of this time-step size, which might distort patient dynamics and lead to suboptimal treatment policies, the extent to which this is a problem in practice remains unexplored. In this work, we conducted empirical experiments for a controlled comparison of four time-step sizes ($Δt\!=\!1,2,4,8$ h) on this domain, following an identical offline RL pipeline. To enable a fair comparison across time-step sizes, we designed action re-mapping methods that allow for evaluation of policies on datasets with different time-step sizes, and conducted cross-$Δt$ model selections under two policy learning setups. Our goal was to quantify how time-step size influences state representation learning, behavior cloning, policy training, and off-policy evaluation. Our results show that performance trends across $Δt$ vary as learning setups change, while policies learned at finer time-step sizes ($Δt = 1$ h and $2$ h) using a static behavior policy achieve the overall best performance and stability. Our work highlights time-step size as a core design choice in offline RL for healthcare and provides evidence supporting alternatives beyond the conventional 4-hour setup.

Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment

TL;DR

This study systematically investigates how time-step discretization affects offline RL for sepsis treatment using MIMIC-III data, evaluating values of 1, 2, 4, and 8 hours within an identical pipeline. It introduces cross- policy evaluation via action mapping and compares two BCQ architectures, finding that finer time steps (1–2 h) with a static behavior policy yield the most stable and high-performing policies. The work demonstrates that fundamentally alters state representation learning, policy training, and off-policy evaluation, arguing that time-step size should be treated as a core design choice beyond the conventional 4-hour setup. It also provides a robust framework for fair cross-time-step comparisons in healthcare RL, paving the way for more granular, clinically aligned policy learning and evaluation.

Abstract

Existing studies on reinforcement learning (RL) for sepsis management have mostly followed an established problem setup, in which patient data are aggregated into 4-hour time steps. Although concerns have been raised regarding the coarseness of this time-step size, which might distort patient dynamics and lead to suboptimal treatment policies, the extent to which this is a problem in practice remains unexplored. In this work, we conducted empirical experiments for a controlled comparison of four time-step sizes ( h) on this domain, following an identical offline RL pipeline. To enable a fair comparison across time-step sizes, we designed action re-mapping methods that allow for evaluation of policies on datasets with different time-step sizes, and conducted cross- model selections under two policy learning setups. Our goal was to quantify how time-step size influences state representation learning, behavior cloning, policy training, and off-policy evaluation. Our results show that performance trends across vary as learning setups change, while policies learned at finer time-step sizes ( h and h) using a static behavior policy achieve the overall best performance and stability. Our work highlights time-step size as a core design choice in offline RL for healthcare and provides evidence supporting alternatives beyond the conventional 4-hour setup.

Paper Structure

This paper contains 21 sections, 4 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Overview of the offline RL pipeline.
  • Figure 2: kNN-policies' Pareto frontiers of performance (WIS vs ESS) across evaluation $t_d$. Each curve corresponds to policies trained at $t_\pi$; hollow markers denote the model selected for testing; dotted lines with different colors represent the thresholds used as the boundary for model selection across $\Delta t$.
  • Figure 3: The illustration of our cross-$\Delta t$ mapping.
  • Figure : $t_D=1\,\mathrm{h}$ dataset
  • Figure : $t_D=1\,\mathrm{h}$
  • ...and 5 more figures