Table of Contents
Fetching ...

Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

Juan Sebastian Rojas, Chi-Guhn Lee

Abstract

The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.

Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

Abstract

The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.
Paper Structure (19 sections, 4 theorems, 24 equations, 6 figures, 2 tables, 7 algorithms)

This paper contains 19 sections, 4 theorems, 24 equations, 6 figures, 2 tables, 7 algorithms.

Key Result

Lemma 3.3

In tabular settings, both interpretations of the TD error are always equivalent. That is, $\delta^{e}_{t} = \delta^{i}_{t} \; \forall t \in \mathbb{N}$.

Figures (6)

  • Figure 1: Rolling average of the TD error absolute difference (i.e., $|\delta^e_t - \delta^i_t|$) as learning progresses when using a) a Q-learning algorithm in the inverted pendulum environment, and b) a DQN algorithm in the Atari Breakout environment. A solid line denotes the mean TD error absolute difference, and a shaded region denotes a 95% confidence interval over 4 runs.
  • Figure 2: Rolling average-reward estimates when using a deep differential Q-learning algorithm in the Breakout environment. Figures a), b), and c) utilize an explicit TD error average-reward update with an initial guess of -0.25, 0.0, and 0.25, respectively. Figures d), e), and f) utilize an implicit TD error update with an initial guess of -0.25, 0.0, and 0.25, respectively. A solid line denotes the mean average-reward estimate, and a shaded region denotes a 95% confidence interval over 4 runs.
  • Figure 3: Rolling total reward per episode when using: a) a differential Q-learning algorithm in the Breakout environment, b) a Q-learning algorithm with value-based reward centering in the Pong environment, and c) an A2C algorithm in the HalfCheetah environment. Each plot shows the performance of each algorithm when using the explicit vs. implicit TD error. A solid line denotes the mean total reward per episode, and a shaded region denotes a 95% confidence interval over 8 runs.
  • Figure B.1: An illustration of the a) inverted pendulum, and b) Atari Breakout environments.
  • Figure B.2: Rolling total reward per episode when using a deep differential Q-learning algorithm in the Breakout environment with an initial average-reward guess of: a) -0.25, b) 0.0, and c) 0.25. Each plot shows the performance when using (batch-averaged) explicit vs. implicit TD error average-reward updates. A solid line denotes the mean total reward per episode, and a shaded region denotes a 95% confidence interval over 4 runs.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Definition 3.1: Explicit TD Error
  • Definition 3.2: Implicit TD Error
  • Lemma 3.3
  • proof
  • Lemma 3.4: Similar to Exercise 9.6 of Sutton2018-eh
  • Lemma 3.5
  • Proposition 4.1
  • proof