Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

Mohammad Boveiri; Peyman Mohajerin Esfahani

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

Mohammad Boveiri, Peyman Mohajerin Esfahani

TL;DR

The paper tackles estimating the optimal Q-function for finite, $γ$-discounted MDPs under a synchronous generative setting. It introduces Variance-Reduced Cascade Q-learning (VRCQ), which combines Cascade Q-learning (CQ) with direct variance reduction to achieve improved $\ell_∞$-norm guarantees and minimax optimality. The authors provide both global minimax and instance-dependent analyses, showing that VRCQ attains near-optimal sample complexity with epoch-based recentering, and they demonstrate instance-optimality in the policy evaluation regime where $|\mathcal U|=1$. Numerical experiments on Garnet MDPs and a two-state example corroborate the theoretical findings, highlighting VRCQ’s practical efficiency and robustness to noise in high-dimensional horizon settings.

Abstract

We study the problem of estimating the optimal Q-function of $γ$-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the $\ell_\infty$-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

TL;DR

The paper tackles estimating the optimal Q-function for finite,

-discounted MDPs under a synchronous generative setting. It introduces Variance-Reduced Cascade Q-learning (VRCQ), which combines Cascade Q-learning (CQ) with direct variance reduction to achieve improved

-norm guarantees and minimax optimality. The authors provide both global minimax and instance-dependent analyses, showing that VRCQ attains near-optimal sample complexity with epoch-based recentering, and they demonstrate instance-optimality in the policy evaluation regime where

. Numerical experiments on Garnet MDPs and a two-state example corroborate the theoretical findings, highlighting VRCQ’s practical efficiency and robustness to noise in high-dimensional horizon settings.

Abstract

We study the problem of estimating the optimal Q-function of

-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the

-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.

Paper Structure (24 sections, 10 theorems, 92 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 10 theorems, 92 equations, 3 figures, 2 tables, 2 algorithms.

Introduction
Setting and Problem Description
Main Results
Cascade Q-Learning: A New Scheme to Reduce the Effect of Noise
Variance-Reduced Cascade Q-learning (VRCQ)
Global Minimax Analysis of VRCQ
Instance-Dependent Analysis of VRCQ When $|\mathcal{U}|=1$
Numerical Results
Conclusion and Future Directions
Proofs of the Main Results
Preliminaries
Proof of Proposition \ref{['proposition.1']}
Proof of Theorem \ref{['theorem.1']}
Proof of Proposition \ref{['proposition.2']}
Proof of Theorem \ref{['theorem.3']}
...and 9 more sections

Key Result

Proposition 1

Consider an MDP with discount factor $\gamma$ and optimal Q-function $\Theta^\star$. Suppose we run Algorithm algorithm.1 from the initialization $\Theta_0$ for $N_e$ iterations with the constant step size $\lambda =\frac{1}{\sqrt{N_e}}$. Then, we have

Figures (3)

Figure 1: Transition diagram a class of MDP, adopted from Khamaru.TD. The scalers $\beta \geq 0$, and $0 < p < 1$ are parameters of the construction. The chain remains in state $1$ with probability $p$ and transitions to state $2$ with probability $1-p$. State 2 is absorbing.
Figure 2: Log-log plots of the $\ell_\infty$-error versus complexity parameter $\frac{1}{1-\gamma}$ for different algorithms. Each data point is an average of $500$ independent trials.
Figure 3: Comparison of the convergence behavior of VRCQ and variance-reduced Q-learning. For a given algorithm and value of $\gamma$, we run the algorithm for a certain number of epochs, thereby obtaining a path of $\ell_\infty$-errors at each iteration. We averaged these paths over a total of $500$ independent trials. The radius of the shaded area at each iteration represents the standard deviation of the $\ell_\infty$-error.

Theorems & Definitions (15)

Proposition 1: Non-asymptotic guarantee for Cascade Q-learning
Theorem 1: Geometric convergence over epochs
Remark 1: Behavior of the parameters over epochs
Remark 2: Geometric convergence with shorter epoch lengths
Proposition 2: Minimax optimality of VRCQ
Remark 3: VRCQ versus other algorithms: Minimax viewpoint
Theorem 2: Khamaru.TD Lower bound on $\mathcal{M}_{N}(\mathcal{P})$
Theorem 3: Non-asymptotic optimality of VRCQ
Remark 4: Instance-dependent upper and lower bounds
Remark 5: VRCQ versus other algorithms: Instance-dependent behavior
...and 5 more

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

TL;DR

Abstract

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)