Table of Contents
Fetching ...

Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

Soham Bonnerjee, Zhipeng Lou, Wei Biao Wu

Abstract

Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ($η_{t}\equiv η$) or polynomially decaying ($η_{t} = ηt^{-α}$) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: $η_{t,n}=η(1-t/n)$) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-$ν$: $η_{t,n}=η(1-t/n)^ν$). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}-$ν$ schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}-$ν$ achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.

Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

Abstract

Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant () or polynomially decaying () learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: ) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-: ). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}- schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}- achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.

Paper Structure

This paper contains 31 sections, 9 theorems, 84 equations, 9 figures.

Key Result

Theorem 3.3

Consider the $Q$-learning iterates in (eq:Q-iterate). Suppose for some $p\geq 2$, the Bellman noise satisfies $\Theta_p:= \mathbb{E}[|Z_t|^p]<\infty$. Then, with the PD2Z-$\nu$ learning schedule with $\eta>0$, $\nu\geq 1/p$ satisfying it holds that where $c_3 = \frac{\eta c_1 - \eta^2 c_2}{2\eta}$ with $c_1=2(1-\gamma), c_2= (1-\gamma)^2 + 2(p-1)\gamma^2$, and $C_1(c, \nu, p)$, $C_2(c, \nu, p)$

Figures (9)

  • Figure 1: Comparison between polynomially decaying ($\eta_t=0.05t^{-0.65}$), LD2Z ($\eta_t=0.05(1-t/n)$) and Constant ($\eta_t=0.05$) step-sizes
  • Figure 2: Performance comparison between LD2Z, PD2Z-$\nu$ with $\nu=2,3$ and constant learning schedules.
  • Figure 3: Q–Q plots of sup-norm distributions.
  • Figure 4: $\mathcal{L}_{\infty}$ error comparison of PR-averaged and tail PR-averaged iterates.
  • Figure 5: Comparison between different step-size choices.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Theorem 3.3
  • Remark 3.4: A sample complexity version of Theorem \ref{['lemma:mse_triangular']}
  • Corollary 3.5
  • Theorem 3.6: chen2020finite, Corollary 4.1.2; Li2023statistical, Theorem E.1
  • Remark 3.7
  • Remark 3.8
  • Corollary 3.9
  • Theorem 3.10
  • Theorem 4.1
  • Remark 4.2
  • ...and 9 more