Table of Contents
Fetching ...

Gradient Iterated Temporal-Difference Learning

Théo Vincent, Kevin Gerhardt, Yogesh Tripathi, Habib Maraqten, Adam White, Martha White, Jan Peters, Carlo D'Eramo

TL;DR

The evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

Abstract

Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

Gradient Iterated Temporal-Difference Learning

TL;DR

The evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

Abstract

Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.
Paper Structure (21 sections, 9 equations, 29 figures, 3 tables, 2 algorithms)

This paper contains 21 sections, 9 equations, 29 figures, 3 tables, 2 algorithms.

Figures (29)

  • Figure 1: Schematic representation of bootstrapping methods in the space of action-value functions, where the space of function approximation is colored in green. Left: TD learning uses a target network $\bar{Q}_0$ to construct a regression target estimating the Bellman iteration $\Gamma \bar{Q}_0$, where $\Gamma$ is the Bellman operator. This quantity is learned by the online network $Q_1$. At every $T$ gradient steps, the target network is updated to the online network to learn the following Bellman iterations. Right: At every step $k$, Gradient TD learning minimizes the distance between the function $Q_k$ and its regression target estimating the Bellman iteration $\Gamma Q_k$.
  • Figure 2: Left: In this figure, iterated TD learning learns $3$ projected Bellman iterations in parallel, where each Bellman iteration $\Gamma \bar{Q}_{k-1}$, built from the target network $\bar{Q}_{k-1}$, is learned with an online network $Q_k$. Right: Gradient iterated TD learning minimizes the sum of Bellman errors $\| \Gamma \bar{Q}_0 - Q_1 \|_2^2 + \| \Gamma Q_1 - Q_2 \|_2^2 + \| \Gamma Q_2 - Q_3 \|_2^2$. Each function $Q_k$ not only learns to regress its target $\Gamma Q_{k-1}$, but also to make the target $\Gamma Q_k$ for the following function $Q_{k+1}$ easier to regress.
  • Figure 3: Training procedure of Gradient Iterated TD learning with $K=2$. $Q_1$ not only learns the regression target built from $\bar{Q}_0$, but is also optimized so that the regression target built from itself is closer to $Q_2$. $Q_2$ learns the regression target built from $Q_1$. $H_2$ approximates the difference between the regression target built from $Q_1$, and $Q_2$. To save parameters, each $Q$ and $H$-network is represented by a head, built on a shared feature extractor, reducing the number of networks to $2$.
  • Figure 4: Evaluating the proposed approach in an off-policy setting with linear function approximation. We clarify that in the bottom plots, TD and i-TD learning overlap. Left: While i-TD and Gi-TD learning are both designed to decrease the sum of Bellman errors, only Gi-TD learning decreases this quantity when evaluated on Baird's counterexample. This leads to a low value error for Gi-TD learning, as opposed to i-TD learning, for which this error increases. Right: It is well-known that gradient TD methods have a slow learning speed when evaluated on the Hall problem baird1995residual. Notably, Gi-TD learning minimizes the value error faster than TDRC.
  • Figure 5: Evaluating the proposed approach in an on-policy setting with nonlinear function approximation, on the Triangle MP. We clarify that in the bottom plots, TD and i-TD learning overlap. Left: i-TD learning increases the sum of Bellman Errors (BEs) during training, while it is designed to minimize it. This translates to high value errors. In contrast, Gi-TD learning decreases this quantity during training, leading to a low final value error. Right: When changing the spiral direction, semi-gradient methods learn faster than gradient TD methods. Importantly, Gi-TD learning exhibits a faster learning speed than TDRC.
  • ...and 24 more figures