Table of Contents
Fetching ...

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

Zhong Zheng, Haochen Zhang, Lingzhou Xue

TL;DR

This paper tackles gap-dependent regret analysis for model-free, on-policy Q-learning in finite-horizon episodic MDPs, focusing on UCB-Advantage and Q-EarlySettled-Advantage that use variance-based bonuses and reference-advantage decomposition. It introduces surrogate reference functions to recover martingale properties and bound the complicated error terms arising from reference and advantage estimates, achieving log-time regret bounds that depend on the minimum suboptimality gap $\\Delta_{\\min}$ and the maximum conditional variance $\\mathbb{Q}^\\star$. The main contributions are the first gap-dependent regret bounds for Q-learning with variance estimators and reference-advantage decomposition, plus a gap-dependent analysis of policy switching cost for UCB-Advantage, improving upon prior worst-case results in non-degenerate MDPs. These results offer practical guidance for exploiting benign MDP structures and demonstrate significant performance improvements in settings with positive gaps and lower variance.

Abstract

We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal $\sqrt{T}$-type regret bound in the worst-case scenario, where $T$ is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for Q-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for Q-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in $T$ and improve upon existing ones for Q-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for Q-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for Q-learning.

Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

TL;DR

This paper tackles gap-dependent regret analysis for model-free, on-policy Q-learning in finite-horizon episodic MDPs, focusing on UCB-Advantage and Q-EarlySettled-Advantage that use variance-based bonuses and reference-advantage decomposition. It introduces surrogate reference functions to recover martingale properties and bound the complicated error terms arising from reference and advantage estimates, achieving log-time regret bounds that depend on the minimum suboptimality gap and the maximum conditional variance . The main contributions are the first gap-dependent regret bounds for Q-learning with variance estimators and reference-advantage decomposition, plus a gap-dependent analysis of policy switching cost for UCB-Advantage, improving upon prior worst-case results in non-degenerate MDPs. These results offer practical guidance for exploiting benign MDP structures and demonstrate significant performance improvements in settings with positive gaps and lower variance.

Abstract

We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal -type regret bound in the worst-case scenario, where is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for Q-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for Q-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in and improve upon existing ones for Q-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for Q-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for Q-learning.

Paper Structure

This paper contains 35 sections, 16 theorems, 246 equations, 4 figures, 3 algorithms.

Key Result

Theorem 3.1

For UCB-Advantage (zhang in zhang1) with $\beta\in (0,H]$, $\mathbb{E}[\textnormal{Regret}(T)]$ is upper bounded by eq_our_regret_intro_zihan.

Figures (4)

  • Figure 1: Numerical comparison of regrets with $H = 5$, $S = 3$, and $A = 2$
  • Figure 2: Numerical comparison of regrets with $H = 10$, $S = 5$, and $A = 5$
  • Figure 3: Policy switching cost of UCB-Advantage algorithm with $H = 5$, $S = 3$, and $A = 2$
  • Figure 4: Policy switching cost of UCB-Advantage algorithm with $H = 10$, $S = 5$, and $A = 5$

Theorems & Definitions (31)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Lemma C.1
  • Lemma C.2
  • Lemma C.3
  • ...and 21 more