Table of Contents
Fetching ...

Finite-Time Analysis of Simultaneous Double Q-learning

Hyunjun Na, Donghwan Lee

TL;DR

The paper tackles maximization bias in Q-learning by introducing simultaneous double Q-learning (SDQ), where two estimators are updated in parallel with cross-referenced greedy actions, removing the need for random estimator switching. It reframes SDQ as a discrete-time switching system and develops an auxiliary framework of upper and lower comparison systems plus an error system to derive finite-time, expected error bounds. Empirical results on Grid World and OpenAI Gym tasks show SDQ converges faster than standard double Q-learning while mitigating overestimation bias. Theoretical contributions include a finite-time bound of the form $\mathbb{E}[\lVert Q_k^A-Q^*\rVert_\infty] \le \frac{120\alpha^{1/2}|\,\mathcal{S}\times\mathcal{A}\,|}{d_{min}^{9/2}(1-\gamma)^{11/2}}+\frac{48\rho^{k-4}k^{4}|\,\mathcal{S}\times\mathcal{A}\,|^{3/2}}{(1-\gamma)}$, with $\rho=1-\alpha d_{min}(1-\gamma)$, illuminating finite-time behavior under i.i.d. sampling with stochastic coverage. Overall, the work provides a control-theoretic lens on double Q-learning, yielding practical convergence benefits and new analytical tools for future extensions to function approximation and adaptive settings.

Abstract

$Q$-learning is one of the most fundamental reinforcement learning (RL) algorithms. Despite its widespread success in various applications, it is prone to overestimation bias in the $Q$-learning update. To address this issue, double $Q$-learning employs two independent $Q$-estimators which are randomly selected and updated during the learning process. This paper proposes a modified double $Q$-learning, called simultaneous double $Q$-learning (SDQ), with its finite-time analysis. SDQ eliminates the need for random selection between the two $Q$-estimators, and this modification allows us to analyze double $Q$-learning through the lens of a novel switching system framework facilitating efficient finite-time analysis. Empirical studies demonstrate that SDQ converges faster than double $Q$-learning while retaining the ability to mitigate the maximization bias. Finally, we derive a finite-time expected error bound for SDQ.

Finite-Time Analysis of Simultaneous Double Q-learning

TL;DR

The paper tackles maximization bias in Q-learning by introducing simultaneous double Q-learning (SDQ), where two estimators are updated in parallel with cross-referenced greedy actions, removing the need for random estimator switching. It reframes SDQ as a discrete-time switching system and develops an auxiliary framework of upper and lower comparison systems plus an error system to derive finite-time, expected error bounds. Empirical results on Grid World and OpenAI Gym tasks show SDQ converges faster than standard double Q-learning while mitigating overestimation bias. Theoretical contributions include a finite-time bound of the form , with , illuminating finite-time behavior under i.i.d. sampling with stochastic coverage. Overall, the work provides a control-theoretic lens on double Q-learning, yielding practical convergence benefits and new analytical tools for future extensions to function approximation and adaptive settings.

Abstract

-learning is one of the most fundamental reinforcement learning (RL) algorithms. Despite its widespread success in various applications, it is prone to overestimation bias in the -learning update. To address this issue, double -learning employs two independent -estimators which are randomly selected and updated during the learning process. This paper proposes a modified double -learning, called simultaneous double -learning (SDQ), with its finite-time analysis. SDQ eliminates the need for random selection between the two -estimators, and this modification allows us to analyze double -learning through the lens of a novel switching system framework facilitating efficient finite-time analysis. Empirical studies demonstrate that SDQ converges faster than double -learning while retaining the ability to mitigate the maximization bias. Finally, we derive a finite-time expected error bound for SDQ.
Paper Structure (35 sections, 19 theorems, 101 equations, 4 figures, 1 table)

This paper contains 35 sections, 19 theorems, 101 equations, 4 figures, 1 table.

Key Result

Lemma 1

gosavi2006boundedness If the step-size is less than one, then for all $k\geq 0$

Figures (4)

  • Figure 1: Left: An example from sutton2018reinforcement. The episode always starts from the $A$ node. Taking the right action from the $A$ node results in zero reward, and the episode is terminated. Otherwise, taking the left action leads to state $B$, where the agent chooses one of 10 available actions. Executing any of these actions results in a reward sampled from a normal distribution with mean $-0.1$ and standard deviation $1$. Then, the episode is terminated as well. Although $Q^{*}(A,\text{right})$ is zero and $Q^{*}(A,\text{left})$ is $-0.1\gamma$, $Q$-learning favors left action because of maximization bias. Right: Comparison of experiment results: SDQ vs. double $Q$-learning vs. $Q$-learning vs. $Q$-learning (perturbed, with randomly initialized $Q$-values).
  • Figure 2: Left: 8×8 Grid world example. Middle: Average cumulative reward per step for each algorithm. Right: Evolution over time of the start‐state’s maximum action‐value.
  • Figure 3: Comparison of experiment results: SDQ vs. double $Q$-learning vs. double $Q$-learning (perturbed, with randomly initialized $Q$-values).
  • Figure 4: Overall flow of the proposed analysis

Theorems & Definitions (35)

  • Definition 3.1
  • Lemma 1
  • Remark 4.1: Applicability to complex environments
  • Theorem 4.2
  • Corollary 4.3
  • Proposition 5.1
  • Proposition 5.2
  • Lemma 2: lee2024final
  • proof
  • Lemma 3: lee2024final
  • ...and 25 more