Table of Contents
Fetching ...

Decision-Point Guided Safe Policy Improvement

Abhishek Sharma, Leo Benac, Sonali Parbhoo, Finale Doshi-Velez

TL;DR

Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs considered for improvement, ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states.

Abstract

Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challenge in SPI is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states. By appropriately limiting where and how we may deviate from the behavior policy, we achieve tighter bounds than prior work; specifically, our data-dependent bounds do not scale with the size of the state and action spaces. In addition to the analysis, we demonstrate that DPRL is both safe and performant on synthetic and real datasets.

Decision-Point Guided Safe Policy Improvement

TL;DR

Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs considered for improvement, ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states.

Abstract

Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challenge in SPI is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states. By appropriately limiting where and how we may deviate from the behavior policy, we achieve tighter bounds than prior work; specifically, our data-dependent bounds do not scale with the size of the state and action spaces. In addition to the analysis, we demonstrate that DPRL is both safe and performant on synthetic and real datasets.

Paper Structure

This paper contains 34 sections, 6 theorems, 27 equations, 7 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1

Let $\pi_\text{DP}$ be the policy obtained by the DP algorithm. Then $\pi_\text{DP}$ is a safe policy improvement over the behavior policy $\pi_\text{b}$, with probability at least $1-\delta$ where $C(N_{\wedge})$ is the count of the number of $(s,a)$ pairs that are observed at least $N_{\wedge}$ times in the dataset:

Figures (7)

  • Figure 1: Top: Challenging MDPs prior approaches. In the first MDP, the behavior goes left with high probability. In the second MDP, the behavior goes to state $b_2$ with high probabilit. Green states are part of optimal trajectories, and red states are part of risky trajectories. Bottom L to R: PQI does poorly on the first MDP, DP has the tightest safety bounds, and CQL does poorly on the second MDP.
  • Figure 2: GridWorld: (left) Illustration of our Gridworld environment, (middle) Bias-variance trade-off managed by $N_{\wedge}$, (right) Performance of DPRL in terms of CVaR and Mean Value. DPRL provides safe policy improvement (CVaR), while matching baselines on mean value.
  • Figure 3: Gridworld: Replacing true behavior with estimated behavior in SPIBB leads to degraded CVaR. DP performs better than SPIBB without access to the true behavior.
  • Figure 4: (Left) OPE value estimates of the learned policies on the MIMIC dataset: DPRL achieves the highest estimated value. (Right) The reason for DPRL's strong performance: for the chosen hyperparameters ($N_{\wedge} = 50$, $r=10$), DPRL defers to the behavior in nearly all states except where it is confident it can achieve a better outcome.
  • Figure 5: DP consistently learns good policies from suboptimal behavior data across Atari environments. Each algorithm is trained on 100,000 samples and evaluated on 20 episodes after training.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 1: DPRL Discrete
  • Theorem 2: DPRL Continuous
  • Lemma 3: Performance Difference Lemma
  • Lemma 4: McDiarmid's Inequality
  • Theorem 4: DPRL Discrete
  • Theorem 4: DPRL Continuous