A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Andrew Patterson; Adam White; Martha White

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

Andrew Patterson, Adam White, Martha White

TL;DR

This work introduces a generalized $PBE$ for off-policy value estimation with nonlinear function approximation, unifying the traditional $BE$ and $PBE$ frameworks and providing bounds on value error. By using a projection-based formulation and a flexible weightings scheme, the authors address identifiability and stability concerns, and show how emphatic weightings can improve solution quality. They derive gradient-based algorithms (TDRC for prediction and QRC for control) that are practical and robust across neural-network function approximations, demonstrated on four control benchmarks. The paper also analyzes the impact of the projection set and state weighting on solution quality, and outlines open questions about combining gradient-correction methods with saddlepoint approaches and optimizing weightings in control.

Abstract

Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective -- the mean-squared Bellman error (MSBE) -- which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

TL;DR

Abstract

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (10)