Finite-Time Analysis of Simultaneous Double Q-learning

Hyunjun Na; Donghwan Lee

Finite-Time Analysis of Simultaneous Double Q-learning

Hyunjun Na, Donghwan Lee

TL;DR

The paper tackles maximization bias in Q-learning by introducing simultaneous double Q-learning (SDQ), where two estimators are updated in parallel with cross-referenced greedy actions, removing the need for random estimator switching. It reframes SDQ as a discrete-time switching system and develops an auxiliary framework of upper and lower comparison systems plus an error system to derive finite-time, expected error bounds. Empirical results on Grid World and OpenAI Gym tasks show SDQ converges faster than standard double Q-learning while mitigating overestimation bias. Theoretical contributions include a finite-time bound of the form $\mathbb{E}[\lVert Q_k^A-Q^*\rVert_\infty] \le \frac{120\alpha^{1/2}|\,\mathcal{S}\times\mathcal{A}\,|}{d_{min}^{9/2}(1-\gamma)^{11/2}}+\frac{48\rho^{k-4}k^{4}|\,\mathcal{S}\times\mathcal{A}\,|^{3/2}}{(1-\gamma)}$, with $\rho=1-\alpha d_{min}(1-\gamma)$, illuminating finite-time behavior under i.i.d. sampling with stochastic coverage. Overall, the work provides a control-theoretic lens on double Q-learning, yielding practical convergence benefits and new analytical tools for future extensions to function approximation and adaptive settings.

Abstract

$Q$-learning is one of the most fundamental reinforcement learning (RL) algorithms. Despite its widespread success in various applications, it is prone to overestimation bias in the $Q$-learning update. To address this issue, double $Q$-learning employs two independent $Q$-estimators which are randomly selected and updated during the learning process. This paper proposes a modified double $Q$-learning, called simultaneous double $Q$-learning (SDQ), with its finite-time analysis. SDQ eliminates the need for random selection between the two $Q$-estimators, and this modification allows us to analyze double $Q$-learning through the lens of a novel switching system framework facilitating efficient finite-time analysis. Empirical studies demonstrate that SDQ converges faster than double $Q$-learning while retaining the ability to mitigate the maximization bias. Finally, we derive a finite-time expected error bound for SDQ.

Finite-Time Analysis of Simultaneous Double Q-learning

TL;DR

, with

, illuminating finite-time behavior under i.i.d. sampling with stochastic coverage. Overall, the work provides a control-theoretic lens on double Q-learning, yielding practical convergence benefits and new analytical tools for future extensions to function approximation and adaptive settings.

Abstract

-learning is one of the most fundamental reinforcement learning (RL) algorithms. Despite its widespread success in various applications, it is prone to overestimation bias in the

-learning update. To address this issue, double

-learning employs two independent

-estimators which are randomly selected and updated during the learning process. This paper proposes a modified double

-learning, called simultaneous double

-learning (SDQ), with its finite-time analysis. SDQ eliminates the need for random selection between the two

-estimators, and this modification allows us to analyze double

-learning through the lens of a novel switching system framework facilitating efficient finite-time analysis. Empirical studies demonstrate that SDQ converges faster than double

-learning while retaining the ability to mitigate the maximization bias. Finally, we derive a finite-time expected error bound for SDQ.

Paper Structure (35 sections, 19 theorems, 101 equations, 4 figures, 1 table)

This paper contains 35 sections, 19 theorems, 101 equations, 4 figures, 1 table.

Introduction
Related works
Preliminaries
Markov decision problem
Switching system
Double Q-learning
Assumption and Definition
Simultaneous double Q-learning (SDQ)
Algorithm
Experiment
Grid World
FrozenLake, CliffWalking, and Taxi Environments
Finite-time error bounds
Comparative convergence analysis
Framework for convergence analysis of SDQ
...and 20 more sections

Key Result

Lemma 1

gosavi2006boundedness If the step-size is less than one, then for all $k\geq 0$

Figures (4)

Figure 1: Left: An example from sutton2018reinforcement. The episode always starts from the $A$ node. Taking the right action from the $A$ node results in zero reward, and the episode is terminated. Otherwise, taking the left action leads to state $B$, where the agent chooses one of 10 available actions. Executing any of these actions results in a reward sampled from a normal distribution with mean $-0.1$ and standard deviation $1$. Then, the episode is terminated as well. Although $Q^{*}(A,\text{right})$ is zero and $Q^{*}(A,\text{left})$ is $-0.1\gamma$, $Q$-learning favors left action because of maximization bias. Right: Comparison of experiment results: SDQ vs. double $Q$-learning vs. $Q$-learning vs. $Q$-learning (perturbed, with randomly initialized $Q$-values).
Figure 2: Left: 8×8 Grid world example. Middle: Average cumulative reward per step for each algorithm. Right: Evolution over time of the start‐state’s maximum action‐value.
Figure 3: Comparison of experiment results: SDQ vs. double $Q$-learning vs. double $Q$-learning (perturbed, with randomly initialized $Q$-values).
Figure 4: Overall flow of the proposed analysis

Theorems & Definitions (35)

Definition 3.1
Lemma 1
Remark 4.1: Applicability to complex environments
Theorem 4.2
Corollary 4.3
Proposition 5.1
Proposition 5.2
Lemma 2: lee2024final
proof
Lemma 3: lee2024final
...and 25 more

Finite-Time Analysis of Simultaneous Double Q-learning

TL;DR

Abstract

Finite-Time Analysis of Simultaneous Double Q-learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (35)