Q-learning as a monotone scheme

Lingyi Yang

Q-learning as a monotone scheme

Lingyi Yang

TL;DR

This paper investigates stability and convergence of Q-learning by viewing it as a monotone fixed-point scheme in a simple 1D deterministic LQ setting. It shows that monotone numerical discretisations, such as upwind schemes for solving the Hamilton–Jacobi–Bellman equation, are crucial to obtaining stable value/policy iterations and derives concrete conditions under which Q-learning preserves monotonicity, including step-size and feature-boundedness requirements. The study reports that in the discrete LQ problem, Q-learning is stable for $0 \le \alpha \le 1$ and can remain stable up to $\alpha = 1.3$ in some runs, but becomes unstable at $\alpha = 1.8$, with function approximation introducing additional monotonicity constraints. It argues that even linear approximators can disrupt monotonicity and that nonlinear approximators may underlie observed instability in practice, highlighting the need to enforce monotonicity to ensure convergence. These insights connect numerical monotonicity theory with reinforcement-learning stability and offer guidance for discretisation and approximation choices in RL.

Abstract

Stability issues with reinforcement learning methods persist. To better understand some of these stability and convergence issues involving deep reinforcement learning methods, we examine a simple linear quadratic example. We interpret the convergence criterion of exact Q-learning in the sense of a monotone scheme and discuss consequences of function approximation on monotonicity properties.

Q-learning as a monotone scheme

TL;DR

and can remain stable up to

in some runs, but becomes unstable at

, with function approximation introducing additional monotonicity constraints. It argues that even linear approximators can disrupt monotonicity and that nonlinear approximators may underlie observed instability in practice, highlighting the need to enforce monotonicity to ensure convergence. These insights connect numerical monotonicity theory with reinforcement-learning stability and offer guidance for discretisation and approximation choices in RL.

Abstract

Paper Structure (5 sections, 3 theorems, 61 equations, 5 figures, 2 algorithms)

This paper contains 5 sections, 3 theorems, 61 equations, 5 figures, 2 algorithms.

Introduction
1D Deterministic Linear Quadratic Problem
Q-Learning
1D continuous LQ problem
Q-learning (discrete setting)

Key Result

Lemma 1

The dynamic programming principle gives us

Figures (5)

Figure 1: Q-learning: learnt value function and policy (blue) against theoretical (orange) for $\alpha = 0.8$
Figure 2: Q-learning: learnt value function and policy (blue) against theoretical (orange) for $\alpha = 1.8$
Figure 3: An intermediate policy and value function for a downwind method (the policy and value function have not converged yet). Instability forms and becomes amplified with further iterations.
Figure 4: An intermediate policy and value function for an upwind method. Note that whilst the policy has not converged yet, there are no instabilities in this case.
Figure 5: Learnt value function and policy (blue) against theoretical (orange) for $\alpha = 1.3$

Theorems & Definitions (6)

Lemma 1
proof
Lemma 2
proof
Lemma 3
proof

Q-learning as a monotone scheme

TL;DR

Abstract

Q-learning as a monotone scheme

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)