Table of Contents
Fetching ...

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Andrew Patterson, Adam White, Martha White

TL;DR

This work introduces a generalized $PBE$ for off-policy value estimation with nonlinear function approximation, unifying the traditional $BE$ and $PBE$ frameworks and providing bounds on value error. By using a projection-based formulation and a flexible weightings scheme, the authors address identifiability and stability concerns, and show how emphatic weightings can improve solution quality. They derive gradient-based algorithms (TDRC for prediction and QRC for control) that are practical and robust across neural-network function approximations, demonstrated on four control benchmarks. The paper also analyzes the impact of the projection set and state weighting on solution quality, and outlines open questions about combining gradient-correction methods with saddlepoint approaches and optimizing weightings in control.

Abstract

Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective -- the mean-squared Bellman error (MSBE) -- which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

TL;DR

This work introduces a generalized for off-policy value estimation with nonlinear function approximation, unifying the traditional and frameworks and providing bounds on value error. By using a projection-based formulation and a flexible weightings scheme, the authors address identifiability and stability concerns, and show how emphatic weightings can improve solution quality. They derive gradient-based algorithms (TDRC for prediction and QRC for control) that are practical and robust across neural-network function approximations, demonstrated on four control benchmarks. The paper also analyzes the impact of the projection set and state weighting on solution quality, and outlines open questions about combining gradient-correction methods with saddlepoint approaches and optimizing weightings in control.

Abstract

Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective -- the mean-squared Bellman error (MSBE) -- which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.

Paper Structure

This paper contains 48 sections, 7 theorems, 113 equations, 17 figures, 1 table.

Key Result

Theorem 4

Assume $c_{\mathcal{F},d} < 1$. Let Then

Figures (17)

  • Figure 1: The visualization above characterizes the true $v_\pi$, $\overline{\text{PBE}}$ solution and how projections operate on successive approximations. Assume the estimate of $v_\pi$ starts from ${\boldsymbol{v}}$ in red. The Bellman operator pushes the value estimate out of the space of representable functions represented by the plane (Note this corresponds to $\mathcal{F} = \mathcal{H}$ introduced in Section \ref{['sec:identifiable-be']}). The projection brings the approximation back down to the nearest representable function on the plane. This process is repeated over and over until the value estimates converge to the blue dot at the base of the black line. Subsequent updates push the approximation to $v_\pi$ out of the space of representable functions and the projection back onto the plane. The true value in this case is outside of $\mathcal{F}$, with the $\overline{\text{VE}}$ being the distance between the ${\boldsymbol{v}}$ at $\overline{\text{PBE}}=0$ and $v_\pi$. Note the projection of $v_\pi$ onto $\mathcal{F}$ need not be equal to $\overline{\text{PBE}}$ solution.
  • Figure 2: A comparison of the $\overline{\text{BE}}$ and $\overline{\text{PBE}}$ solutions when the true value function is not representable. As before, we visualize how the approximation that minimizes the $\overline{\text{PBE}}$ at convergence can be far from $v_\pi$ with a large projection penalty. The approximate value function that minimizes the $\overline{\text{BE}}$ on the other hand is closer to $v_\pi$ and typically has a smaller projection penalty (note the Bellman operator would indeed push ${\boldsymbol{v}}_{{\text{\tiny BE}}}$ outside $\mathcal{F}$).
  • Figure 3: A visual interpretation of how the Bellman operator can push the value estimates outside the space of representable functions and the role of the projection operator. The set $\mathcal{F}$ corresponds to the (parameterized) space of value functions and $\mathcal{H}$ is the set of functions that approximate (project) the Bellman error $\mathcal{T} v - v$. Potential settings include $\mathcal{F} = \mathcal{H}$ (visualized in Figure \ref{['fig:pbe-sol']}), $\mathcal{F} \subset \mathcal{H}$ visualized in (a) and $\mathcal{F} \neq \mathcal{H}$ visualized in (b). In (a), we highlight two cases: $\mathcal{T} v$ is not representable by any function in $\mathcal{F}$ or $\mathcal{H}$, or $\mathcal{T} v$ is representable by functions in $\mathcal{H}$ but not $\mathcal{F}$. In (b) we see examples of projections when $\mathcal{F}$ intersects $\mathcal{H}$.
  • Figure 4: The visualization above shows how the $\overline{\text{PBE}}$ solution can result in arbitrarily bad value error under some behaviours. The blue line above is the same as the visualization used in prior work to demonstrate issues with minimizing $\overline{\text{PBE}}$ (see kolter2011fixed for a description of the counterexample). The vertical axis measures $\overline{\text{VE}}$ and the horizontal axis different behavior policies. This figure differs from kolter2011fixed; we show that the $\overline{\text{BE}}$ solution exhibits low error and highlight the impact of changing $\mathcal{H}$. The size of the set $\mathcal{H}$ increases from the left subplot to the right. More behavior policies result in low generalized $\overline{\text{PBE}}$ as the set $\mathcal{H}$ increases.
  • Figure 5: Investigating the $\overline{\text{VE}}$ of the fixed-points of $\overline{\text{PBE}}$ and $\overline{\text{BE}}$ under $d_b$, $d_\pi$, and $m$ on a 19-state random walk. All errors are computed closed form given access to the reward and transition dynamics. The fixed-point of the $\overline{\text{PBE}}$ with emphatic weighting consistently has the lowest error across several different state representations (light color); while the fixed-point of the $\overline{\text{PBE}}$ under $d_b$ has the highest error (dark blue). Results are averaged over one million randomly generated policies and state representations.
  • ...and 12 more figures

Theorems & Definitions (10)

  • Definition 2: Discounted Transition Constant
  • Definition 3: Operator Constant
  • Theorem 4
  • Definition 5: State Weighting Mismatch
  • Corollary 6
  • Proposition 7
  • Corollary 8
  • Theorem 9
  • Corollary 10
  • Theorem 11