Table of Contents
Fetching ...

Bellman-consistent Pessimism for Offline Reinforcement Learning

Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, Alekh Agarwal

TL;DR

The notion of Bellman-consistent pessimism for general function approximation is introduced: instead of calculating a point-wise lower bound for the value function, pessimism is implemented at the initial state over the set of functions consistent with the Bellman equations.

Abstract

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

Bellman-consistent Pessimism for Offline Reinforcement Learning

TL;DR

The notion of Bellman-consistent pessimism for general function approximation is introduced: instead of calculating a point-wise lower bound for the value function, pessimism is implemented at the initial state over the set of functions consistent with the Bellman equations.

Abstract

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

Paper Structure

This paper contains 41 sections, 29 theorems, 141 equations, 1 figure.

Key Result

Theorem 3.1

Let $\varepsilon = \varepsilon_r$ where is $\varepsilon_r$ defined in Eq.eq:def_varepsilon_r and ${\widehat{\pi}}$ be obtained by Eq.eq:infotheosol. Then, for any policy $\pi \in \Pi$ and any constant $C_2 \ge 1$, with probability at least $1-\delta$, where $\mathscr C(\nu;\mu,\mathcal{F},\pi)$ is defined in Definition def:concenbddist, $(d_{\pi}\setminus\nu)(s,a) \coloneqq \max(d_\pi(s,a) - \nu(

Figures (1)

  • Figure 1: An example illustrating different on-support and off-support splittings (denoted by two different vertical lines). Different splitting has different $C_2$ values, and further yields different bias-variance trade-offs.

Theorems & Definitions (53)

  • Definition 1
  • Theorem 3.1
  • Corollary 1: "Double Robustness"
  • Corollary 2: Competing with optimal policy
  • Corollary 3: Bounded degradation from behavior policy
  • proof : Proof sketch of Theorem \ref{['thm:infothebd2']}
  • Definition 2: Linear Function Approximation
  • Theorem 3.2
  • Theorem 4.1
  • Corollary 4: "Double Robustness"
  • ...and 43 more