Regularized Q-learning through Robust Averaging
Peter Schmitt-Förster, Tobias Sutter
TL;DR
The paper addresses the overestimation bias in asynchronous Q-learning by introducing Regularized Q-learning through Robust Averaging (2RA), a bias-controllable variant built on a distributionally robust estimator $\mathcal{E}_{\rho}$ and averaging over $N$ Q-function estimates. The update remains computationally comparable to standard Q-learning, and the authors establish strong theoretical results: almost-sure convergence to $Q^\star$ as $\rho_n\to0$ for any $N$, explicit control of estimation bias via $(\rho,N)$ with unbiasedness recovered as $N\to\infty$, and AMSE equivalence to Watkins' Q-learning under an appropriate learning-rate scaling $\alpha_n^{QL}=g/n$ and $\alpha_n=Ng/n$. Empirically, 2RA Q-learning demonstrates robust performance across Baird's example, random environments, and CartPole, often outperforming existing Q-learning variants, particularly when environment bias preferences align with the controlled bias. Overall, the method provides a principled, tunable mechanism to balance bias and variance in Q-learning while preserving convergence guarantees and practical efficiency, making it a valuable addition for reliable RL in uncertain transition dynamics.
Abstract
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.
