The hidden risks of temporal resampling in clinical reinforcement learning
Thomas Frost, Hrisheekesh Vaidya, Steve Harris
TL;DR
This work investigates how temporal resampling of clinical data, a common preprocessing step in offline reinforcement learning, can cause distributional shifts that degrade live performance. By simulating irregular decision intervals with LavaGap and the UVA/Padova diabetes simulator, the authors show that training on binned or interpolated data yields substantially worse policies than training on unprocessed, irregular data, and that off-policy evaluation can overestimate performance on post-processed datasets. They identify three failure mechanisms: counterfactual trajectories, distorted temporal expectations, and accumulated generalisation errors, and demonstrate that preserving irregular timing via semi-Markov decision processes improves alignment with real clinical dynamics. The study advocates moving beyond default binning practices toward RL frameworks that explicitly handle irregular decision-making, highlighting important implications for the deployment safety of healthcare ORL systems.
Abstract
Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.
