Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity
Mohammad Boveiri, Peyman Mohajerin Esfahani
TL;DR
The paper tackles estimating the optimal Q-function for finite, $γ$-discounted MDPs under a synchronous generative setting. It introduces Variance-Reduced Cascade Q-learning (VRCQ), which combines Cascade Q-learning (CQ) with direct variance reduction to achieve improved $\ell_∞$-norm guarantees and minimax optimality. The authors provide both global minimax and instance-dependent analyses, showing that VRCQ attains near-optimal sample complexity with epoch-based recentering, and they demonstrate instance-optimality in the policy evaluation regime where $|\mathcal U|=1$. Numerical experiments on Garnet MDPs and a two-state example corroborate the theoretical findings, highlighting VRCQ’s practical efficiency and robustness to noise in high-dimensional horizon settings.
Abstract
We study the problem of estimating the optimal Q-function of $γ$-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the $\ell_\infty$-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.
