Reward Centering

Abhishek Naik; Yi Wan; Manan Tomar; Richard S. Sutton

Reward Centering

Abhishek Naik, Yi Wan, Manan Tomar, Richard S. Sutton

TL;DR

It is shown that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average, and if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected.

Abstract

We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.

Reward Centering

TL;DR

Abstract

Paper Structure (9 sections, 2 theorems, 18 equations, 15 figures, 2 tables, 3 algorithms)

This paper contains 9 sections, 2 theorems, 18 equations, 15 figures, 2 tables, 3 algorithms.

Theory of Reward Centering
Simple Reward Centering
Value-based Reward Centering
Case Study: Q-learning with Reward Centering
Discussion, Limitations, and Future Work
Pseudocode
Theoretical Details
Experimental Details
Connections to Related Approaches

Key Result

Theorem 1

If the Markov chain induced by the stationary behavior policy is irreducible and a per-state--action step size is reduced appropriately, tabular Q-learning with value-based reward centering (eq:update_cdiscq_action_values–eq:update_cdiscq_rbar_TDerror) converges almost surely: $Q_t$ and $\bar{R}_t$

Figures (15)

Figure 1: Learning curves showing the difference in performance of Q-learning with and without reward centering for different discount factors on the Access-Control Queuing problem (Sutton & Barto, 1998). Plotted is the average per-step reward obtained by the agent across 50 runs w.r.t. the number of time steps of interaction. The shaded region denotes one standard error. See Section \ref{['sec:case_study']}.
Figure 2: Comparison of the standard and the centered discounted values on a simple example.
Figure 3: Learning curves demonstrating the performance of TD-learning with and without reward centering on one on-policy problem and two off-policy problems.
Figure 4: Parameter studies showing the sensitivity of the algorithms’ performance to their parameters on the Access-Control problem. The error bars indicate one standard error, which at times is less than the width of the lines.
Figure 5: Learning curves on slight variants of the Access-Control Queuing problem with all the rewards shifted by a constant integer. The $y$-axis is shifted to compare learning curves for all the variants on the same scale. More details in-text.
...and 10 more figures

Theorems & Definitions (4)

Theorem 1
proof
Lemma 1
proof

Reward Centering

TL;DR

Abstract

Reward Centering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (4)