Finite-Time Error Bounds for Greedy-GQ

Yue Wang; Yi Zhou; Shaofeng Zou

Finite-Time Error Bounds for Greedy-GQ

Yue Wang, Yi Zhou, Shaofeng Zou

TL;DR

This work addresses finite-time error bounds for Greedy-GQ with linear function approximation in off-policy reinforcement learning. It develops a tight two-timescale analysis that yields $\mathcal{O}(1/\sqrt{T})$ convergence under i.i.d. data and $\mathcal{O}(\log T/\sqrt{T})$ under Markovian data, despite non-convexity and gradient bias. It further proposes a nested-loop variant achieving near-optimal sample complexity up to a $\mathcal{O}(\log(1/\varepsilon))$ factor, with a bound of $\mathcal{O}(\log(1/\varepsilon)\varepsilon^{-2})$; this matches vanilla Greedy-GQ’s performance and aligns with SGD rates for general non-convex problems. The results provide practical guidance on step-size selection and illuminate the trade-off between convergence speed and policy quality, while offering a general framework for analyzing non-convex two-timescale RL algorithms. Overall, the paper advances theoretical understanding of finite-sample behavior in off-policy, value-based RL with two timescales and nonlinear updates, and informs design of efficient, scalable algorithms.

Abstract

Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with the non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as $\mathcal{O}({1}/{\sqrt{T}})$ under the i.i.d.\ setting and $\mathcal{O}({\log T}/{\sqrt{T}})$ under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is $\mathcal{O}({\log(1/ε)ε^{-2}})$, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems, despite its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.

Finite-Time Error Bounds for Greedy-GQ

TL;DR

This work addresses finite-time error bounds for Greedy-GQ with linear function approximation in off-policy reinforcement learning. It develops a tight two-timescale analysis that yields

convergence under i.i.d. data and

under Markovian data, despite non-convexity and gradient bias. It further proposes a nested-loop variant achieving near-optimal sample complexity up to a

factor, with a bound of

; this matches vanilla Greedy-GQ’s performance and aligns with SGD rates for general non-convex problems. The results provide practical guidance on step-size selection and illuminate the trade-off between convergence speed and policy quality, while offering a general framework for analyzing non-convex two-timescale RL algorithms. Overall, the paper advances theoretical understanding of finite-sample behavior in off-policy, value-based RL with two timescales and nonlinear updates, and informs design of efficient, scalable algorithms.

Abstract

under the i.i.d.\ setting and

under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is

, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems, despite its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.

Paper Structure (26 sections, 17 theorems, 82 equations, 5 figures, 2 algorithms)

This paper contains 26 sections, 17 theorems, 82 equations, 5 figures, 2 algorithms.

Introduction
Main Challenges and Contributions
Related Work
Preliminaries
Markov Decision Process
Linear Function Approximation
Greedy-GQ Algorithm
Finite-Time Error Bound for Greedy-GQ
Technical Assumptions
Finite-time Error Bound and Sample Complexity
Discussion on Technical Challenges
Discussion on Theoretical Results
Nested-loop Greedy-GQ
Algorithm
Finite-time Error Bound and Sample Complexity
...and 11 more sections

Key Result

Lemma 1

The softmax policy $\pi_{\theta}(a\vert s)$ is $2\sigma$-Lipschitz and $8\sigma^2$-smooth, i.e., for any $(s,a)\in\mathcal{S}\times\mathcal{A}$, and for any $\theta_1,\theta_2\in\mathbb R^N$,

Figures (5)

Figure 1: Garnet Problem 1.
Figure 2: Garnet Problem 2.
Figure 3: Frozen Lake Problem 1.
Figure 4: Frozen Lake Problem 2.
Figure :

Theorems & Definitions (28)

Lemma 1
Theorem 1
Corollary 1
Theorem 2
Corollary 2
Remark 1
Lemma 2
proof
Lemma 3
proof
...and 18 more

Finite-Time Error Bounds for Greedy-GQ

TL;DR

Abstract

Finite-Time Error Bounds for Greedy-GQ

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (28)