Table of Contents
Fetching ...

Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies

Runze Yan, Xun Shen, Akifumi Wachi, Sebastien Gros, Anni Zhao, Xiao Hu

TL;DR

This paper tackles the risk of unsafe generalization in offline RL for medical treatment by proposing Offline Guarded Safe Reinforcement Learning (OGSRL), a model-based offline framework with dual constraints: an OOD guardian that confines learning to clinically supported state-action regions and a safety cost constraint ensuring physiological safety. The guardian uses a polynomial sublevel set (PSoS) to approximate the in-distribution region with provable high-probability containment, while the policy optimization operates on an Estimated CMDP (E-CMDP) augmented with an OOD cost bound. The authors provide finite-sample safety and near-optimality guarantees that connect dataset size, model errors, and planning horizon to performance bounds, and validate the approach on real sepsis data, showing guardian-enhanced policies achieve superior clinician alignment, safety, and outcomes (notably, GMB-CPO achieves ME = 0.0138, a substantial improvement over SOC). The work advances safe offline medical RL by integrating distributional safeguards with domain-specific safety constraints, illustrating practical potential for decision-support systems that improve patient outcomes without extrapolating beyond observed data.

Abstract

When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose \textit{Offline Guarded Safe Reinforcement Learning} ($\mathsf{OGSRL}$), a theoretically grounded model-based offline RL framework. $\mathsf{OGSRL}$ introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data.

Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies

TL;DR

This paper tackles the risk of unsafe generalization in offline RL for medical treatment by proposing Offline Guarded Safe Reinforcement Learning (OGSRL), a model-based offline framework with dual constraints: an OOD guardian that confines learning to clinically supported state-action regions and a safety cost constraint ensuring physiological safety. The guardian uses a polynomial sublevel set (PSoS) to approximate the in-distribution region with provable high-probability containment, while the policy optimization operates on an Estimated CMDP (E-CMDP) augmented with an OOD cost bound. The authors provide finite-sample safety and near-optimality guarantees that connect dataset size, model errors, and planning horizon to performance bounds, and validate the approach on real sepsis data, showing guardian-enhanced policies achieve superior clinician alignment, safety, and outcomes (notably, GMB-CPO achieves ME = 0.0138, a substantial improvement over SOC). The work advances safe offline medical RL by integrating distributional safeguards with domain-specific safety constraints, illustrating practical potential for decision-support systems that improve patient outcomes without extrapolating beyond observed data.

Abstract

When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose \textit{Offline Guarded Safe Reinforcement Learning} (), a theoretically grounded model-based offline RL framework. introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data.

Paper Structure

This paper contains 40 sections, 12 theorems, 63 equations, 13 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For any probability level $\alpha > 0$ and any $\alpha_{\mathsf{c}} > \alpha$, there exists a polynomial degree $d$ such that the following holds: $\mathsf{Pr}\left( \widehat{\mathcal{U}}_{\hat{\theta}^N_{\alpha_{\mathsf{c}}}, d} \not\subset \mathcal{U}_{\mathsf{id}} \right) \leq \exp\left( -2 N^2 (

Figures (13)

  • Figure 1: Results on state distributions by learned policies via different algorithms. Blue points represent the original offline dataset; orange points represent the states visited by the learned policies.
  • Figure 2: Comparison of cumulative reward distributions between the SOC (green) and policies by different algorithms with guard mechanisms (blue). Each subplot shows the estimated reward density for trajectories in the test set. Dashed vertical lines indicate the mean rewards. (a) $\mathsf{CQL}$ vs. $\mathsf{GCQL}$;(b) $\mathsf{CCQL}$ vs. $\mathsf{GCCQL}$; (c) $\mathsf{MB\text{-}TRPO}$ vs. $\mathsf{GMB\text{-}TRPO}$; (d) $\mathsf{MB\text{-}CPO}$ vs. $\mathsf{GMB\text{-}CPO}$.
  • Figure 3: Physiological safety assessment of learned policies. We evaluate the safety of learned treatment policies by analyzing two critical physiological states: SpO$_{2}$ and urine output. Our assessment compares the percentage of states falling below defined safety thresholds against SOC. Positive values represent a reduction in unsafe states compared to SOC, while negative values indicate an increase.
  • Figure 4: Physiological state progression under model-based reinforcement learning methods ($\mathsf{MB\text{-}TRPO}$, $\mathsf{GMB\text{-}TRPO}$, $\mathsf{MB\text{-}CPO}$, and $\mathsf{GMB\text{-}CPO)}$. Each step is represented by a box plot, where each box shows the interquartile range (25th-75th percentiles) with the horizontal line indicating the median. Whiskers extend to 1.5$\times$IQR, and black dots represent outliers - individual measurements falling outside this range.
  • Figure 5: Physiological state progression under model-free reinforcement learning methods ($\mathsf{CQL}$, $\mathsf{GCQL}$, $\mathsf{CCQL}$, and $\mathsf{GCCQL}$). Each step is represented by a box plot, where each box shows the interquartile range (25th-75th percentiles) with the horizontal line indicating the median. Whiskers extend to 1.5$\times$IQR, and black dots represent outliers - individual measurements falling outside this range.
  • ...and 8 more figures

Theorems & Definitions (21)

  • Definition 1: Polynomial sublevel set
  • Theorem 1
  • Definition 2
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 11 more