Table of Contents
Fetching ...

The hidden risks of temporal resampling in clinical reinforcement learning

Thomas Frost, Hrisheekesh Vaidya, Steve Harris

TL;DR

This work investigates how temporal resampling of clinical data, a common preprocessing step in offline reinforcement learning, can cause distributional shifts that degrade live performance. By simulating irregular decision intervals with LavaGap and the UVA/Padova diabetes simulator, the authors show that training on binned or interpolated data yields substantially worse policies than training on unprocessed, irregular data, and that off-policy evaluation can overestimate performance on post-processed datasets. They identify three failure mechanisms: counterfactual trajectories, distorted temporal expectations, and accumulated generalisation errors, and demonstrate that preserving irregular timing via semi-Markov decision processes improves alignment with real clinical dynamics. The study advocates moving beyond default binning practices toward RL frameworks that explicitly handle irregular decision-making, highlighting important implications for the deployment safety of healthcare ORL systems.

Abstract

Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.

The hidden risks of temporal resampling in clinical reinforcement learning

TL;DR

This work investigates how temporal resampling of clinical data, a common preprocessing step in offline reinforcement learning, can cause distributional shifts that degrade live performance. By simulating irregular decision intervals with LavaGap and the UVA/Padova diabetes simulator, the authors show that training on binned or interpolated data yields substantially worse policies than training on unprocessed, irregular data, and that off-policy evaluation can overestimate performance on post-processed datasets. They identify three failure mechanisms: counterfactual trajectories, distorted temporal expectations, and accumulated generalisation errors, and demonstrate that preserving irregular timing via semi-Markov decision processes improves alignment with real clinical dynamics. The study advocates moving beyond default binning practices toward RL frameworks that explicitly handle irregular decision-making, highlighting important implications for the deployment safety of healthcare ORL systems.

Abstract

Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.
Paper Structure (21 sections, 7 equations, 5 figures, 1 table)

This paper contains 21 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: LavaGap Environment: To reach the green goal square, the agent (red triangle) must navigate the grid without contacting the lava (orange squares). The model uses partial observations (light grey) to calculate the optimal path. The episode ends with a reward of 1 if the agent reaches the goal, or 0 if it hit lava or exceeded the maximum number of steps.
  • Figure 2: Example patient trajectory in the UVA/Padova simulator, before (top panel) and after (bottom panel) temporal binning. Background shading indicates the reward landscape, with positive (green) rewards in the target glycaemic range and negative (red) rewards for hypo- or hyperglycaemia. An example of a counterfactual trajectory that reverses cause and effect is shown circled in the lower panel.
  • Figure 3: Impact of temporal resampling on offline RL performance. Agents trained via behavioural cloning (BC), implicit Q-learning (IQL), or conservative Q-learning (CQL) were evaluated on (a) the discrete LavaGap navigation task and (b) the continuous UVA/Padova insulin control task. Models were trained using unprocessed, interpolated, or temporally binned datasets and deployed in both regular and irregular versions of the environment. In both domains, agents trained on the unprocessed dataset consistently achieved the highest returns, whereas binned and interpolated datasets led to significant performance degradation. The pink band indicates the expert proximal policy optimisation (PPO) baseline used to generate the training data. For UVA/Padova, average returns are normalised (0.0 for a random policy; 1.0 for the highest observed score). Shaded regions and error bars represent 95% confidence intervals (CIs).
  • Figure 4: Calibration plot showing reliability of off-policy evaluation across different types of dataset preprocessing. The plot compares the true online performance of trained agents in the UVA/Padova environment against the performance predicted by fitted Q-evaluation (FQE). Performance is normalised such that 0.0 represents a random policy and 1.0 represents the dataset's behaviour policy. While agents trained on unprocessed (blue) and interpolated (orange) data show high calibration (clustering near the diagonal), those trained on temporally binned data (green, red) exhibit severe overestimation bias. Error bars represent 95% confidence intervals.
  • Figure S1: Extension of Figure \ref{['fig:simglucose-example']}, showing the full reward landscape for all permissible glucose levels in the UVA/Padova T1DM simulator.