Finite-Time Analysis of Simultaneous Double Q-learning
Hyunjun Na, Donghwan Lee
TL;DR
The paper tackles maximization bias in Q-learning by introducing simultaneous double Q-learning (SDQ), where two estimators are updated in parallel with cross-referenced greedy actions, removing the need for random estimator switching. It reframes SDQ as a discrete-time switching system and develops an auxiliary framework of upper and lower comparison systems plus an error system to derive finite-time, expected error bounds. Empirical results on Grid World and OpenAI Gym tasks show SDQ converges faster than standard double Q-learning while mitigating overestimation bias. Theoretical contributions include a finite-time bound of the form $\mathbb{E}[\lVert Q_k^A-Q^*\rVert_\infty] \le \frac{120\alpha^{1/2}|\,\mathcal{S}\times\mathcal{A}\,|}{d_{min}^{9/2}(1-\gamma)^{11/2}}+\frac{48\rho^{k-4}k^{4}|\,\mathcal{S}\times\mathcal{A}\,|^{3/2}}{(1-\gamma)}$, with $\rho=1-\alpha d_{min}(1-\gamma)$, illuminating finite-time behavior under i.i.d. sampling with stochastic coverage. Overall, the work provides a control-theoretic lens on double Q-learning, yielding practical convergence benefits and new analytical tools for future extensions to function approximation and adaptive settings.
Abstract
$Q$-learning is one of the most fundamental reinforcement learning (RL) algorithms. Despite its widespread success in various applications, it is prone to overestimation bias in the $Q$-learning update. To address this issue, double $Q$-learning employs two independent $Q$-estimators which are randomly selected and updated during the learning process. This paper proposes a modified double $Q$-learning, called simultaneous double $Q$-learning (SDQ), with its finite-time analysis. SDQ eliminates the need for random selection between the two $Q$-estimators, and this modification allows us to analyze double $Q$-learning through the lens of a novel switching system framework facilitating efficient finite-time analysis. Empirical studies demonstrate that SDQ converges faster than double $Q$-learning while retaining the ability to mitigate the maximization bias. Finally, we derive a finite-time expected error bound for SDQ.
