Regularized Q-learning through Robust Averaging

Peter Schmitt-Förster; Tobias Sutter

Regularized Q-learning through Robust Averaging

Peter Schmitt-Förster, Tobias Sutter

TL;DR

The paper addresses the overestimation bias in asynchronous Q-learning by introducing Regularized Q-learning through Robust Averaging (2RA), a bias-controllable variant built on a distributionally robust estimator $\mathcal{E}_{\rho}$ and averaging over $N$ Q-function estimates. The update remains computationally comparable to standard Q-learning, and the authors establish strong theoretical results: almost-sure convergence to $Q^\star$ as $\rho_n\to0$ for any $N$, explicit control of estimation bias via $(\rho,N)$ with unbiasedness recovered as $N\to\infty$, and AMSE equivalence to Watkins' Q-learning under an appropriate learning-rate scaling $\alpha_n^{QL}=g/n$ and $\alpha_n=Ng/n$. Empirically, 2RA Q-learning demonstrates robust performance across Baird's example, random environments, and CartPole, often outperforming existing Q-learning variants, particularly when environment bias preferences align with the controlled bias. Overall, the method provides a principled, tunable mechanism to balance bias and variance in Q-learning while preserving convergence guarantees and practical efficiency, making it a valuable addition for reliable RL in uncertain transition dynamics.

Abstract

We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.

Regularized Q-learning through Robust Averaging

TL;DR

and averaging over

Q-function estimates. The update remains computationally comparable to standard Q-learning, and the authors establish strong theoretical results: almost-sure convergence to

for any

, explicit control of estimation bias via

with unbiasedness recovered as

, and AMSE equivalence to Watkins' Q-learning under an appropriate learning-rate scaling

and

. Empirically, 2RA Q-learning demonstrates robust performance across Baird's example, random environments, and CartPole, often outperforming existing Q-learning variants, particularly when environment bias preferences align with the controlled bias. Overall, the method provides a principled, tunable mechanism to balance bias and variance in Q-learning while preserving convergence guarantees and practical efficiency, making it a valuable addition for reliable RL in uncertain transition dynamics.

Abstract

Paper Structure (17 sections, 8 theorems, 87 equations, 5 figures)

This paper contains 17 sections, 8 theorems, 87 equations, 5 figures.

Introduction
Related Work.
Contribution.
Problem Setting
Regularization through Robust Averaging
Asymptotic Convergence
Estimation Bias
Asymptotic Mean-Squared Error
Numerical Results
Baird's Example.
Random Environment.
CartPole.
Discussion and Conclusion
Linearization results
Additional Numerical Results
...and 2 more sections

Key Result

Lemma 1

The estimator defined in eq:robust:estimator is equivalently expressed as

Figures (5)

Figure 1: Baird's Example. All Methods use an initial learning rate of $\alpha_0=0.01$, $w_{\alpha} = 10^5$, and $\gamma = 0.8$. All 2RA agents additionally use $w_{\rho}=10^{3}$. The reward function has values random-uniformly sampled from $[-0.05, 0.05]$. All results are average over $100$ consecutive experiments. (a) Baird's example environment with the feature vectors for each state-action pair. (b) Comparison of the AMSE of Watkins Q-learning, Double Q-learning, Maxmin Q-learning with $N=10$, where the 2RA Q-learning uses initial $\rho_0=0.5$ and $N=10$. (c) Comparison of the AMSE of 2RA Q-learning with $N=10$ but different initial values $\rho_0$. (d) Experiment showing the MSE in terms of mean and standard deviation for different values of $N$ with $\rho_0=0.5$.
Figure 2: Random Environment. All methods use an initial learning rate of $\alpha_0=0.01$, $w_{\alpha}=10^{5}$, $\gamma = 0.9$, and all $\theta^{(i)}$ initialized as zero. Maxmin as well as 2RA Q-learning have $N=10$ and 2RA agents additionally use $\rho_{0}=50$ and $w_{\rho}=10^{4}$. The plots show the first six randomly drawn environments and all results are average over $100$ consecutive experiments. A broader plot of the first 20 random environments is provided in Figure \ref{['fig:appednix_random_env_figures']} in Appendix \ref{['app:additional:experiments']}.
Figure 3: Cartpole, 1000 experiments. All methods use an initial learning rate of $\alpha_{0}=0.4$, $w_{\alpha}=100$, $\gamma=0.999$ and all $\theta^{(i)}$ initialized as zero. Maxmin, as well as 2RA Q-learning, have $N=8$. 2RA further uses $\rho_{0}=150$ and $w_{\rho}=10^{4}$. All algorithms are evaluated after every $50$ episodes and recorded if the average evaluation reward reaches or exceeds $195$. (a) Shows the distributions of each algorithm's hit times and (b) lists the respective mean hit times and corresponding standard deviations.
Figure 4: Random Environment. All methods use an initial learning rate of $\alpha_0=0.01$, $w_{\alpha}=10^{5}$, $\gamma = 0.9$, and all $\theta^{(i)}$ initialized as zero. Maxmin as well as 2RA Q-learning have $N=10$ and 2RA agents additionally use $\rho_{0}=50$ and $w_{\rho}=10^{4}$. The plots show the first 20 randomly drawn environments.
Figure 5: LunarLander, 100 experiments. All methods use a learning rate of $\alpha=0.0002$ and a decay factor of $\gamma=0.99$. Maxmin, as well as 2RA Q-learning, have $N=5$. 2RA further uses $\rho_{0}=25$ and $w_{\rho}=10^{4}$. All algorithms are evaluated every $50$ episodes and recorded if the average evaluation reward reaches or exceeds 200. (a) Shows the distributions of each algorithm's hit times and (b) lists the respective mean hit times and corresponding standard deviations.

Theorems & Definitions (17)

Lemma 1: Estimator computation
proof
Theorem 1: Asymptotic convergence
Lemma 2: ref:Csaba-00
proof : Proof of Theorem \ref{['thm:convergence']}
Theorem 2: Estimation bias
proof
Lemma 3: ref:Brockwell-91
Corollary 1: Vanishing estimation bias
proof
...and 7 more

Regularized Q-learning through Robust Averaging

TL;DR

Abstract

Regularized Q-learning through Robust Averaging

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (17)