Table of Contents
Fetching ...

Linear $Q$-Learning Does Not Diverge in $L^2$: Convergence Rates to a Bounded Set

Xinyu Liu, Zixuan Xie, Shangtong Zhang

TL;DR

This paper proves new nonasymptotic $L^2$ convergence rates for linear and tabular $Q$-learning with unmodified algorithms, under an $ε$-softmax behavior policy with adaptive temperature and minimal assumptions. The core technical contribution is a general stochastic approximation framework that handles time-inhomogeneous Markov noise with fast-changing transitions, enabling explicit rates toward a bounded set for the linear case and convergence to $q_*$ for the tabular case. A key technical device is the pseudo-contraction property of the weighted Bellman operator in the tabular setting and a corresponding Lyapunov function constructed via Moreau envelopes. The results bridge a gap between asymptotic boundedness and finite-sample convergence, with practical impact for understanding the reliability of linear and tabular $Q$-learning in non-ideal, off-policy, Markovian settings.

Abstract

$Q$-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed that $Q$-learning with linear function approximation (i.e., linear $Q$-learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linear $Q$-learning. Building on this success, this paper further establishes the first $L^2$ convergence rate of linear $Q$-learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linear $Q$-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an $ε$-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the $L^2$ convergence rate of tabular $Q$-learning with an $ε$-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

Linear $Q$-Learning Does Not Diverge in $L^2$: Convergence Rates to a Bounded Set

TL;DR

This paper proves new nonasymptotic convergence rates for linear and tabular -learning with unmodified algorithms, under an -softmax behavior policy with adaptive temperature and minimal assumptions. The core technical contribution is a general stochastic approximation framework that handles time-inhomogeneous Markov noise with fast-changing transitions, enabling explicit rates toward a bounded set for the linear case and convergence to for the tabular case. A key technical device is the pseudo-contraction property of the weighted Bellman operator in the tabular setting and a corresponding Lyapunov function constructed via Moreau envelopes. The results bridge a gap between asymptotic boundedness and finite-sample convergence, with practical impact for understanding the reliability of linear and tabular -learning in non-ideal, off-policy, Markovian settings.

Abstract

-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed that -learning with linear function approximation (i.e., linear -learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linear -learning. Building on this success, this paper further establishes the first convergence rate of linear -learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linear -learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is an -softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish the convergence rate of tabular -learning with an -softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

Paper Structure

This paper contains 29 sections, 23 theorems, 145 equations, 2 figures, 2 tables.

Key Result

Theorem 1

Let Assumptions assum:markov and assu lr hold. Then for sufficiently small $\epsilon$ in eq:mu_linear, sufficiently large $\kappa_0$ in eq:temperature, and sufficiently large $t_0$ in $\alpha_t$, there exist some constant $\bar{t}$ such that the iterates $\qty{w_t}$ generated by eq:linear q update s (2) When $\epsilon_\alpha \in (0,1)$, there exist some constants $B_{thm:stan,4}$, $B_{thm:stan,5}$

Figures (2)

  • Figure 1: Convergence of \ref{['eq:linear q update']} with $\gamma = 0.99, \alpha = 0.1$. The graph shows the evolution of $\|w_t\|_2^2$ over time steps, demonstrating stable convergence behavior. The blue line represents the average of the squared $L^2$ norm of weights over 10 independent runs, and the shaded area indicates the range between minimum and maximum values.
  • Figure 2: Each scenario is independently run 10 times, with the solid lines representing the averages of the squared $L^2$ norm of weights, and the shaded areas indicating the ranges between minimum and maximum values. This comparison illustrates the impact of different modification strategies on the algorithm's convergence behavior.

Theorems & Definitions (45)

  • Theorem 1: $L^2$ Convergence Rate of Linear $Q$-Learning
  • Theorem 2: $L^2$ Convergence Rate of Tabular $Q$-Learning
  • Theorem 3
  • proof
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Remark 1
  • proof
  • ...and 35 more