Table of Contents
Fetching ...

Selective Uncertainty Propagation in Offline RL

Sanath Kumar Krishnamurthy, Tanmay Gangwani, Sumeet Katariya, Branislav Kveton, Shrey Modi, Anshuka Rangi

TL;DR

The paper tackles the challenge of evaluating policies in finite-horizon offline RL when actions change future state distributions, leading to distribution shifts that complicate CI construction. It introduces selective uncertainty propagation, which blends offline contextual-bandit methods, optimistic/pessimistic RL value estimates, and shift-model information to create tight, instance-adaptive confidence intervals for the step-$h$ treatment effect ${\alpha}^{(h)}_{\pi}$. A key theoretical contribution is a high-probability bound on the estimation error that adapts to the estimated hardness via input quality, enabling CB-like rates when shifts are small and RL-like guarantees when they are large. The paper also modifies pessimistic value iteration to SPVI, which maximizes a selective lower bound, and provides empirical results on ChainBandit and GridWorld showing improved CI quality and offline policy learning, particularly in less dynamic (CB-like) settings.

Abstract

We consider the finite-horizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step h in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.

Selective Uncertainty Propagation in Offline RL

TL;DR

The paper tackles the challenge of evaluating policies in finite-horizon offline RL when actions change future state distributions, leading to distribution shifts that complicate CI construction. It introduces selective uncertainty propagation, which blends offline contextual-bandit methods, optimistic/pessimistic RL value estimates, and shift-model information to create tight, instance-adaptive confidence intervals for the step- treatment effect . A key theoretical contribution is a high-probability bound on the estimation error that adapts to the estimated hardness via input quality, enabling CB-like rates when shifts are small and RL-like guarantees when they are large. The paper also modifies pessimistic value iteration to SPVI, which maximizes a selective lower bound, and provides empirical results on ChainBandit and GridWorld showing improved CI quality and offline policy learning, particularly in less dynamic (CB-like) settings.

Abstract

We consider the finite-horizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step h in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.
Paper Structure (12 sections, 3 theorems, 29 equations, 5 figures, 1 algorithm)

This paper contains 12 sections, 3 theorems, 29 equations, 5 figures, 1 algorithm.

Key Result

Theorem 4.1

Suppose we have: (1) CB inputs $({\hat{\theta}}^{(h)}_{\pi},{\kappa}^{(h)}_{\pi,\theta})$; (2) RL inputs $({\hat{V}_{\pi, p}}^{(h+1)},{\hat{V}_{\pi}}^{(h+1)},{\hat{V}_{\pi, o}}^{(h+1)})$ satisfying eq:ordering; and (3) shift inputs $({\hat{\Delta}}^{(h)},{\kappa}^{(h)}_{\pi,\Delta})$ satisfying eq:s Now for some fixed $\delta>0$, with probability at least $1-\delta-\delta_{\text{in}}$, we have the

Figures (5)

  • Figure 1: ChainBandit MDP with horizon/length of 4 (this is an adjustable environment parameter). The environment has two chains, a top chain, and a bottom chain. The environment also has three actions given by $a_1,a_2,a_3$. The top chain states are the most rewarding. The agent starts at (1,0). At any state in the bottom chain, all the actions lead to the same transition (which is to move to the next state in the bottom chain) and are essentially bandit states. In the top chain, both $a_1$ and $a_2$ lead to the same transitions (which is to move to the next state in the top chain), and $a_3$ makes the agent move to the next state in the bottom chain. In the top chain, the highest cumulative reward comes from never taking action $a_3$; however, the highest immediate reward comes from selecting the action $a_3$ (which makes planning beneficial in this environment). Note that at every state, action $a_3$ is a sub-optimal action.
  • Figure 2: We plot CIs for ${\alpha}^{(2)}_{\pi}$ while varying the evaluation policy. These evaluation policies are parameterized by $\lambda\in[0,1]$. For all states/steps, the probability of selecting $a_1, a_2$ and $a_3$ are $(1-\lambda)/2,(1-\lambda)/2,$ and $\lambda$ respectively. Note that the evaluation policy is the same as the behavioral policy for $\lambda=0.8$. The number of training episodes is 10000, and the plots are averaged over 10 runs.
  • Figure 3: Policy learning with a bad behavioral policy
  • Figure 4: Plotting ${\alpha}^{(2)}_{\pi}$ with varying the evaluation policy $\pi$. Here $\lambda$ (evaluation policy probability of taking the down action) corresponds to the X-axis.
  • Figure 5: Learning Experiment on GridWorld

Theorems & Definitions (4)

  • Theorem 4.1
  • Theorem A.1
  • Lemma A.1
  • proof