Finite-Time Error Bounds for Greedy-GQ
Yue Wang, Yi Zhou, Shaofeng Zou
TL;DR
This work addresses finite-time error bounds for Greedy-GQ with linear function approximation in off-policy reinforcement learning. It develops a tight two-timescale analysis that yields $\mathcal{O}(1/\sqrt{T})$ convergence under i.i.d. data and $\mathcal{O}(\log T/\sqrt{T})$ under Markovian data, despite non-convexity and gradient bias. It further proposes a nested-loop variant achieving near-optimal sample complexity up to a $\mathcal{O}(\log(1/\varepsilon))$ factor, with a bound of $\mathcal{O}(\log(1/\varepsilon)\varepsilon^{-2})$; this matches vanilla Greedy-GQ’s performance and aligns with SGD rates for general non-convex problems. The results provide practical guidance on step-size selection and illuminate the trade-off between convergence speed and policy quality, while offering a general framework for analyzing non-convex two-timescale RL algorithms. Overall, the paper advances theoretical understanding of finite-sample behavior in off-policy, value-based RL with two timescales and nonlinear updates, and informs design of efficient, scalable algorithms.
Abstract
Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with the non-convex objective function. This paper develops its tightest finite-time error bounds. We show that the Greedy-GQ algorithm converges as fast as $\mathcal{O}({1}/{\sqrt{T}})$ under the i.i.d.\ setting and $\mathcal{O}({\log T}/{\sqrt{T}})$ under the Markovian setting. We further design a variant of the vanilla Greedy-GQ algorithm using the nested-loop approach, and show that its sample complexity is $\mathcal{O}({\log(1/ε)ε^{-2}})$, which matches with the one of the vanilla Greedy-GQ. Our finite-time error bounds match with one of the stochastic gradient descent algorithms for general smooth non-convex optimization problems, despite its additonal challenge in the two time-scale updates. Our finite-sample analysis provides theoretical guidance on choosing step-sizes for faster convergence in practice, and suggests the trade-off between the convergence rate and the quality of the obtained policy. Our techniques provide a general approach for finite-sample analysis of non-convex two timescale value-based reinforcement learning algorithms.
