New Versions of Gradient Temporal Difference Learning
Donghwan Lee, Han-Dong Lim, Jihoon Park, Okyong Choi
TL;DR
The paper tackles the instability of off-policy TD learning with linear function approximation by introducing three GTD variants (GTD3, GTD4, GTD5) grounded in convex-concave saddle-point representations, thereby unifying GTD2 and related formulations under a single primal-dual gradient dynamics framework. It develops multiple saddle-point viewpoints (dual representation and Fenchel duality) and provides a PDGD-based convergence analysis, including an alternative ODE-based justification and a regularized Lagrangian approach. The key contributions are the new GTD3–GTD5 algorithms, a unified saddle-point analysis template for RL, and extensive simulations showing that GTD4 and GTD5 often converge faster than GTD2 and GTD3, especially when the regularization weight $\sigma$ diminishes over time. These results offer a more stable and efficient approach to off-policy policy evaluation with linear function approximation and point toward broader applicability of saddle-point methods in RL.
Abstract
Sutton, Szepesvári and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.
