Two-Step Q-Learning

Antony Vijesh; Shreyas S R

Two-Step Q-Learning

Antony Vijesh, Shreyas S R

TL;DR

This work tackles maximization bias and slow convergence in off-policy Q-learning by introducing two-step Q-learning (TSQL) and its smooth variant (S-TSQL) that avoid importance sampling. The authors prove almost-sure convergence for TSQL to the optimal $Q^*$ and establish convergence of S-TSQL to the fixed point of a smooth Bellman operator $U$, with explicit boundedness results under standard stochastic-approximation assumptions. Empirically, TSQL and S-TSQL reduce bias and improve learning performance across maximization-bias benchmarks, random MDPs, and a roulette/MAB setting, often outperforming classical Q-learning, Double Q-learning, and SORQL. The methods are presented as practical, robust off-policy algorithms with provable guarantees and favorable empirical behavior, offering a flexible framework for bias control through the parameter sequence $\theta_n$.

Abstract

Q-learning is a stochastic approximation version of the classic value iteration. The literature has established that Q-learning suffers from both maximization bias and slower convergence. Recently, multi-step algorithms have shown practical advantages over existing methods. This paper proposes a novel off-policy two-step Q-learning algorithms, without importance sampling. With suitable assumption it was shown that, iterates in the proposed two-step Q-learning is bounded and converges almost surely to the optimal Q-values. This study also address the convergence analysis of the smooth version of two-step Q-learning, i.e., by replacing max function with the log-sum-exp function. The proposed algorithms are robust and easy to implement. Finally, we test the proposed algorithms on benchmark problems such as the roulette problem, maximization bias problem, and randomly generated Markov decision processes and compare it with the existing methods available in literature. Numerical experiments demonstrate the superior performance of both the two-step Q-learning and its smooth variants.

Two-Step Q-Learning

TL;DR

and establish convergence of S-TSQL to the fixed point of a smooth Bellman operator

, with explicit boundedness results under standard stochastic-approximation assumptions. Empirically, TSQL and S-TSQL reduce bias and improve learning performance across maximization-bias benchmarks, random MDPs, and a roulette/MAB setting, often outperforming classical Q-learning, Double Q-learning, and SORQL. The methods are presented as practical, robust off-policy algorithms with provable guarantees and favorable empirical behavior, offering a flexible framework for bias control through the parameter sequence

Abstract

Paper Structure (10 sections, 29 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 10 sections, 29 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Literature Review
Preliminaries
Proposed Algorithm
Main Results and Convergence Analysis
Experiments
Maximization Bias Example
Random MDPs
Roulette as Multi-Armed Bandit problem
Conclusion

Figures (7)

Figure 1: Performance of QL, D-Q-Avg, S-TSQL, and TSQL with $\alpha_n=\frac{1}{n+1}$, and $\theta_n=\frac{1}{n^2+10}$.
Figure 2: Performance of TSQL with $\alpha_n=\frac{10}{n+100}$, and various choice of $\theta_n$.
Figure 3: Performance of S-TSQL with $\alpha_n=\frac{10}{n+100}$, and various choice of $\theta_n$.
Figure 4: Performance of TSQL with $\theta_n=\frac{1}{n^2+10}$, and various choice of $\alpha_n$.
Figure 5: Performance of S-TSQL with $\theta_n=\frac{1}{n^2+10}$, and various choice of $\alpha_n$.
...and 2 more figures

Theorems & Definitions (4)

proof
proof
proof
proof

Two-Step Q-Learning

TL;DR

Abstract

Two-Step Q-Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (4)