Table of Contents
Fetching ...

Bellman Error Centering

Xingguo Chen, Yu Gong, Shangdong Yang, Wenhao Wang

Abstract

This paper revisits the recently proposed reward centering algorithms including simple reward centering (SRC) and value-based reward centering (VRC), and points out that SRC is indeed the reward centering, while VRC is essentially Bellman error centering (BEC). Based on BEC, we provide the centered fixpoint for tabular value functions, as well as the centered TD fixpoint for linear value function approximation. We design the on-policy CTD algorithm and the off-policy CTDC algorithm, and prove the convergence of both algorithms. Finally, we experimentally validate the stability of our proposed algorithms. Bellman error centering facilitates the extension to various reinforcement learning algorithms.

Bellman Error Centering

Abstract

This paper revisits the recently proposed reward centering algorithms including simple reward centering (SRC) and value-based reward centering (VRC), and points out that SRC is indeed the reward centering, while VRC is essentially Bellman error centering (BEC). Based on BEC, we provide the centered fixpoint for tabular value functions, as well as the centered TD fixpoint for linear value function approximation. We design the on-policy CTD algorithm and the off-policy CTDC algorithm, and prove the convergence of both algorithms. Finally, we experimentally validate the stability of our proposed algorithms. Bellman error centering facilitates the extension to various reinforcement learning algorithms.

Paper Structure

This paper contains 18 sections, 3 theorems, 82 equations, 7 figures.

Key Result

Theorem 4.1

(Convergence of on-policy CTD). In the case of on-policy learning, consider the iterations (omega) and (theta). Let the step-size sequences $\alpha_k$ and $\beta_k$, $k\geq 0$ satisfy in this case $\alpha_k,\beta_k>0$, for all $k$, $\sum_{k=0}^{\infty}\alpha_k=\sum_{k=0}^{\infty}\beta_k=\infty,$$\su

Figures (7)

  • Figure 1: Learning curses of three evaluation environments.
  • Figure 2: Boychain.
  • Figure 3: Sensitivity of various algorithms to learning rates for Boyanchain.
  • Figure 4: 2-state off-policy counterexample.
  • Figure 5: Sensitivity of various algorithms to learning rates for 2-state counterexample.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 4.1
  • proof
  • Theorem 5.1
  • proof
  • Lemma 6.1
  • proof