Table of Contents
Fetching ...

The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

Noah Golowich, Ankur Moitra

TL;DR

The paper investigates offline reinforcement learning with linear function approximation under low inherent Bellman error. It proves that, under a single-policy coverage condition, there exists a computationally efficient offline algorithm whose suboptimality scales with the square root of BE, i.e., proportional to $\sqrt{\varepsilon_{BE}}$, plus statistical error that decays with dataset size $n$; moreover, this $\sqrt{\varepsilon_{BE}}$ rate is optimal in a minimax sense. A key technical contribution is the introduction of perturbed linear policies and a Gaussian-smoothing-based analysis to obtain approximate Bellman restricted closedness, enabling an actor-critic method with no-regret guarantees (FTPL) in the offline setting. The paper also provides a matching lower bound showing that misspecification effects in offline RL with BE cannot, in general, be reduced below $\sqrt{\varepsilon_{BE}}$, underscoring a fundamental separation from online RL. Collectively, these results establish sharp, end-to-end, computation-efficient guarantees for offline RL with linear structure under BE, and illuminate the cost of misspecification in batch settings.

Abstract

In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.

The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

TL;DR

The paper investigates offline reinforcement learning with linear function approximation under low inherent Bellman error. It proves that, under a single-policy coverage condition, there exists a computationally efficient offline algorithm whose suboptimality scales with the square root of BE, i.e., proportional to , plus statistical error that decays with dataset size ; moreover, this rate is optimal in a minimax sense. A key technical contribution is the introduction of perturbed linear policies and a Gaussian-smoothing-based analysis to obtain approximate Bellman restricted closedness, enabling an actor-critic method with no-regret guarantees (FTPL) in the offline setting. The paper also provides a matching lower bound showing that misspecification effects in offline RL with BE cannot, in general, be reduced below , underscoring a fundamental separation from online RL. Collectively, these results establish sharp, end-to-end, computation-efficient guarantees for offline RL with linear structure under BE, and illuminate the cost of misspecification in batch settings.

Abstract

In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error , we show that the suboptimality error of our algorithm scales with . Furthermore, we prove that the scaling of the suboptimality with cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.
Paper Structure (48 sections, 24 theorems, 109 equations, 4 algorithms)

This paper contains 48 sections, 24 theorems, 109 equations, 4 algorithms.

Key Result

Theorem 1.1

There is an algorithm (namely, alg:actor) which given the dataset $\mathcal{D}$ as input, outputs a policy $\hat{\pi}$ at random so that for any policy $\pi^\star$, we have Moreover, alg:actor runs in time $\mathop{\mathrm{poly}}\nolimits(d,H,n)$.

Theorems & Definitions (48)

  • Theorem 1.1: Informal version of \ref{['thm:pacle-ftpl']}
  • Theorem 1.2: Informal version of \ref{['thm:lb-formal']}
  • Remark 1.1: Confluence of terminology
  • Theorem 1.3: $\Pi^{\mathsf{Plin}, {\sigma}}$-Bellman restricted closedness; informal version of \ref{['cor:ibe-linear']}
  • Definition 2.1: Perturbed linear policies
  • Definition 2.2: Coverage parameter
  • Lemma 3.1
  • Corollary 3.2
  • proof
  • Corollary 3.3
  • ...and 38 more