Table of Contents
Fetching ...

Finite-Time Error Bounds for Greedy-GQ

Yue Wang, Yi Zhou, Shaofeng Zou

TL;DR

This work addresses finite-time error bounds for Greedy-GQ with linear function approximation in off-policy reinforcement learning. It develops a tight two-timescale analysis that yields $\mathcal{O}(1/\sqrt{T})$ convergence under i.i.d. data and $\mathcal{O}(\log T/\sqrt{T})$ under Markovian data, despite non-convexity and gradient bias. It further proposes a nested-loop variant achieving near-optimal sample complexity up to a $\mathcal{O}(\log(1/\varepsilon))$ factor, with a bound of $\mathcal{O}(\log(1/\varepsilon)\varepsilon^{-2})$; this matches vanilla Greedy-GQ’s performance and aligns with SGD rates for general non-convex problems. The results provide practical guidance on step-size selection and illuminate the trade-off between convergence speed and policy quality, while offering a general framework for analyzing non-convex two-timescale RL algorithms. Overall, the paper advances theoretical understanding of finite-sample behavior in off-policy, value-based RL with two timescales and nonlinear updates, and informs design of efficient, scalable algorithms.

Abstract

Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with the non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as $\mathcal{O}({1}/{\sqrt{T}})$ under the i.i.d.\ setting and $\mathcal{O}({\log T}/{\sqrt{T}})$ under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is $\mathcal{O}({\log(1/ε)ε^{-2}})$, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems, despite its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.

Finite-Time Error Bounds for Greedy-GQ

TL;DR

This work addresses finite-time error bounds for Greedy-GQ with linear function approximation in off-policy reinforcement learning. It develops a tight two-timescale analysis that yields convergence under i.i.d. data and under Markovian data, despite non-convexity and gradient bias. It further proposes a nested-loop variant achieving near-optimal sample complexity up to a factor, with a bound of ; this matches vanilla Greedy-GQ’s performance and aligns with SGD rates for general non-convex problems. The results provide practical guidance on step-size selection and illuminate the trade-off between convergence speed and policy quality, while offering a general framework for analyzing non-convex two-timescale RL algorithms. Overall, the paper advances theoretical understanding of finite-sample behavior in off-policy, value-based RL with two timescales and nonlinear updates, and informs design of efficient, scalable algorithms.

Abstract

Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with the non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as under the i.i.d.\ setting and under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is , which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems, despite its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.
Paper Structure (26 sections, 17 theorems, 82 equations, 5 figures, 2 algorithms)

This paper contains 26 sections, 17 theorems, 82 equations, 5 figures, 2 algorithms.

Key Result

Lemma 1

The softmax policy $\pi_{\theta}(a\vert s)$ is $2\sigma$-Lipschitz and $8\sigma^2$-smooth, i.e., for any $(s,a)\in\mathcal{S}\times\mathcal{A}$, and for any $\theta_1,\theta_2\in\mathbb R^N$,

Figures (5)

  • Figure 1: Garnet Problem 1.
  • Figure 2: Garnet Problem 2.
  • Figure 3: Frozen Lake Problem 1.
  • Figure 4: Frozen Lake Problem 2.
  • Figure :

Theorems & Definitions (28)

  • Lemma 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Remark 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 18 more