Table of Contents
Fetching ...

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

Mohammad Boveiri, Peyman Mohajerin Esfahani

TL;DR

The paper tackles estimating the optimal Q-function for finite, $γ$-discounted MDPs under a synchronous generative setting. It introduces Variance-Reduced Cascade Q-learning (VRCQ), which combines Cascade Q-learning (CQ) with direct variance reduction to achieve improved $\ell_∞$-norm guarantees and minimax optimality. The authors provide both global minimax and instance-dependent analyses, showing that VRCQ attains near-optimal sample complexity with epoch-based recentering, and they demonstrate instance-optimality in the policy evaluation regime where $|\mathcal U|=1$. Numerical experiments on Garnet MDPs and a two-state example corroborate the theoretical findings, highlighting VRCQ’s practical efficiency and robustness to noise in high-dimensional horizon settings.

Abstract

We study the problem of estimating the optimal Q-function of $γ$-discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the $\ell_\infty$-norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.

Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity

TL;DR

The paper tackles estimating the optimal Q-function for finite, -discounted MDPs under a synchronous generative setting. It introduces Variance-Reduced Cascade Q-learning (VRCQ), which combines Cascade Q-learning (CQ) with direct variance reduction to achieve improved -norm guarantees and minimax optimality. The authors provide both global minimax and instance-dependent analyses, showing that VRCQ attains near-optimal sample complexity with epoch-based recentering, and they demonstrate instance-optimality in the policy evaluation regime where . Numerical experiments on Garnet MDPs and a two-state example corroborate the theoretical findings, highlighting VRCQ’s practical efficiency and robustness to noise in high-dimensional horizon settings.

Abstract

We study the problem of estimating the optimal Q-function of -discounted Markov decision processes (MDPs) under the synchronous setting, where independent samples for all state-action pairs are drawn from a generative model at each iteration. We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ). VRCQ comprises two key building blocks: (i) the established direct variance reduction technique and (ii) our proposed variance reduction scheme, Cascade Q-learning. By leveraging these techniques, VRCQ provides superior guarantees in the -norm compared with the existing model-free stochastic approximation-type algorithms. Specifically, we demonstrate that VRCQ is minimax optimal. Additionally, when the action set is a singleton (so that the Q-learning problem reduces to policy evaluation), it achieves non-asymptotic instance optimality while requiring the minimum number of samples theoretically possible. Our theoretical results and their practical implications are supported by numerical experiments.
Paper Structure (24 sections, 10 theorems, 92 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 10 theorems, 92 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Consider an MDP with discount factor $\gamma$ and optimal Q-function $\Theta^\star$. Suppose we run Algorithm algorithm.1 from the initialization $\Theta_0$ for $N_e$ iterations with the constant step size $\lambda =\frac{1}{\sqrt{N_e}}$. Then, we have

Figures (3)

  • Figure 1: Transition diagram a class of MDP, adopted from Khamaru.TD. The scalers $\beta \geq 0$, and $0 < p < 1$ are parameters of the construction. The chain remains in state $1$ with probability $p$ and transitions to state $2$ with probability $1-p$. State 2 is absorbing.
  • Figure 2: Log-log plots of the $\ell_\infty$-error versus complexity parameter $\frac{1}{1-\gamma}$ for different algorithms. Each data point is an average of $500$ independent trials.
  • Figure 3: Comparison of the convergence behavior of VRCQ and variance-reduced Q-learning. For a given algorithm and value of $\gamma$, we run the algorithm for a certain number of epochs, thereby obtaining a path of $\ell_\infty$-errors at each iteration. We averaged these paths over a total of $500$ independent trials. The radius of the shaded area at each iteration represents the standard deviation of the $\ell_\infty$-error.

Theorems & Definitions (15)

  • Proposition 1: Non-asymptotic guarantee for Cascade Q-learning
  • Theorem 1: Geometric convergence over epochs
  • Remark 1: Behavior of the parameters over epochs
  • Remark 2: Geometric convergence with shorter epoch lengths
  • Proposition 2: Minimax optimality of VRCQ
  • Remark 3: VRCQ versus other algorithms: Minimax viewpoint
  • Theorem 2: Khamaru.TD Lower bound on $\mathcal{M}_{N}(\mathcal{P})$
  • Theorem 3: Non-asymptotic optimality of VRCQ
  • Remark 4: Instance-dependent upper and lower bounds
  • Remark 5: VRCQ versus other algorithms: Instance-dependent behavior
  • ...and 5 more