Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies
Runze Yan, Xun Shen, Akifumi Wachi, Sebastien Gros, Anni Zhao, Xiao Hu
TL;DR
This paper tackles the risk of unsafe generalization in offline RL for medical treatment by proposing Offline Guarded Safe Reinforcement Learning (OGSRL), a model-based offline framework with dual constraints: an OOD guardian that confines learning to clinically supported state-action regions and a safety cost constraint ensuring physiological safety. The guardian uses a polynomial sublevel set (PSoS) to approximate the in-distribution region with provable high-probability containment, while the policy optimization operates on an Estimated CMDP (E-CMDP) augmented with an OOD cost bound. The authors provide finite-sample safety and near-optimality guarantees that connect dataset size, model errors, and planning horizon to performance bounds, and validate the approach on real sepsis data, showing guardian-enhanced policies achieve superior clinician alignment, safety, and outcomes (notably, GMB-CPO achieves ME = 0.0138, a substantial improvement over SOC). The work advances safe offline medical RL by integrating distributional safeguards with domain-specific safety constraints, illustrating practical potential for decision-support systems that improve patient outcomes without extrapolating beyond observed data.
Abstract
When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose \textit{Offline Guarded Safe Reinforcement Learning} ($\mathsf{OGSRL}$), a theoretically grounded model-based offline RL framework. $\mathsf{OGSRL}$ introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data.
