Table of Contents
Fetching ...

Provably Efficient and Agile Randomized Q-Learning

He Wang, Xingyu Xu, Yuejie Chi

TL;DR

A novel variant of Q-learning algorithm is proposed, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL and exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.

Abstract

While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning process. In this paper, we propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL. We establish an $\widetilde{O}(\sqrt{H^5SAT})$ regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the episode length, and $T$ is the total number of episodes. In addition, we present a logarithmic regret bound under a mild positive sub-optimality condition on the optimal Q-function. Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.

Provably Efficient and Agile Randomized Q-Learning

TL;DR

A novel variant of Q-learning algorithm is proposed, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL and exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.

Abstract

While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning process. In this paper, we propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL. We establish an regret bound, where is the number of states, is the number of actions, is the episode length, and is the total number of episodes. In addition, we present a logarithmic regret bound under a mild positive sub-optimality condition on the optimal Q-function. Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.

Paper Structure

This paper contains 54 sections, 22 theorems, 154 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 1

Consider $\delta\in (0,1)$. Assume that $J = \lceil{c\cdot\log(SAHT/\delta)}\rceil$, $\kappa^{\flat} = c\cdot(\log(SAH/\delta) + \log(T))$, and $n_0^{\flat} = \lceil c\cdot\log(T)\cdot \kappa \rceil$, where $c$ is some universal constant. Let the initialized value function $V_h^0 = 2(H-h+1)$ for any

Figures (1)

  • Figure 1: Comparison between RandomizedQ and baseline algorithms in the grid-world environment (cf. the first row) and the chain MDP (cf. the second row), where total regret is plotted against the number of episodes. RandomizedQ consistently achieves lower regret than UCB-Q, as well as both the standard randomized Q-learning (i.e., RandQL) and its stage-wise variant (i.e., Staged-RandQL), demonstrating superior sample efficiency and faster learning processes.

Theorems & Definitions (29)

  • Theorem 1
  • Remark 1: Anytime convergence guarantee
  • Theorem 2
  • Definition 1: Beta distribution
  • Lemma 1: Moments of the Beta distribution
  • Lemma 2
  • proof
  • Lemma 3: Rosenthal inequality, Theorem 4.1 in pinelis1994optimum
  • Lemma 4: Corrected version of Lemma 12 in tiapkin2024model.
  • proof
  • ...and 19 more