Table of Contents
Fetching ...

New Versions of Gradient Temporal Difference Learning

Donghwan Lee, Han-Dong Lim, Jihoon Park, Okyong Choi

TL;DR

The paper tackles the instability of off-policy TD learning with linear function approximation by introducing three GTD variants (GTD3, GTD4, GTD5) grounded in convex-concave saddle-point representations, thereby unifying GTD2 and related formulations under a single primal-dual gradient dynamics framework. It develops multiple saddle-point viewpoints (dual representation and Fenchel duality) and provides a PDGD-based convergence analysis, including an alternative ODE-based justification and a regularized Lagrangian approach. The key contributions are the new GTD3–GTD5 algorithms, a unified saddle-point analysis template for RL, and extensive simulations showing that GTD4 and GTD5 often converge faster than GTD2 and GTD3, especially when the regularization weight $\sigma$ diminishes over time. These results offer a more stable and efficient approach to off-policy policy evaluation with linear function approximation and point toward broader applicability of saddle-point methods in RL.

Abstract

Sutton, Szepesvári and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.

New Versions of Gradient Temporal Difference Learning

TL;DR

The paper tackles the instability of off-policy TD learning with linear function approximation by introducing three GTD variants (GTD3, GTD4, GTD5) grounded in convex-concave saddle-point representations, thereby unifying GTD2 and related formulations under a single primal-dual gradient dynamics framework. It develops multiple saddle-point viewpoints (dual representation and Fenchel duality) and provides a PDGD-based convergence analysis, including an alternative ODE-based justification and a regularized Lagrangian approach. The key contributions are the new GTD3–GTD5 algorithms, a unified saddle-point analysis template for RL, and extensive simulations showing that GTD4 and GTD5 often converge faster than GTD2 and GTD3, especially when the regularization weight diminishes over time. These results offer a more stable and efficient approach to off-policy policy evaluation with linear function approximation and point toward broader applicability of saddle-point methods in RL.

Abstract

Sutton, Szepesvári and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.

Paper Structure

This paper contains 16 sections, 16 theorems, 41 equations, 3 figures, 4 algorithms.

Key Result

Lemma 1

Consider the nonlinear system eq:nonlinear-system, and assume that $f$ is globally Lipschitz continuous, i.e., $\|f(x)-f(y)\|\le l \|x-y\|, \forall x,y \in {\mathbb R}^n$ for some $l>0$ and norm $\|\cdot\|$. Then, it admits a unique solution $x_t$ for all $t\geq 0$ and $x_0\in {\mathbb R}^n$.

Figures (3)

  • Figure 1: First instance: (a) Evolution of error, $\left\| \theta_k - \theta^* \right\|$, for step-size $\alpha _k = 5/(k + 5)$; (b) Evolution of error, $\left\| \theta_k - \theta^* \right\|$, for step-size $\alpha _k = 10/(k + 10)$. The figure illustrates error evolutions for GTD2 (blue), GTD3 (red), GTD4 (green), GTD5 (magenta) in a logarithmic scale. For GTD4 and GTD5, we used a diminishing $\sigma$: $\sigma _k = 100/(k + 100)$.
  • Figure 2: Second instance: (a) Evolution of error, $\left\| \theta_k - \theta^* \right\|$, for step-size $\alpha _k = 5/(k + 5)$; (b) Evolution of error, $\left\| \theta_k - \theta^* \right\|$, for step-size $\alpha _k = 10/(k + 10)$. The figure illustrates error evolutions for GTD2 (blue), GTD3 (red), GTD4 (green), GTD5 (magenta) in a logarithmic scale. For GTD4 and GTD5, we used a diminishing $\sigma$: $\sigma _k = 100/(k + 100)$.
  • Figure 3: Rankings of GTDs for 5000 MDP instances.

Theorems & Definitions (31)

  • Lemma 1: khalil2002nonlinear
  • Lemma 2: borkar2000ode
  • Definition 1: Saddle-point
  • Lemma 3
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Remark 1
  • ...and 21 more