Q-learning as a monotone scheme
Lingyi Yang
TL;DR
This paper investigates stability and convergence of Q-learning by viewing it as a monotone fixed-point scheme in a simple 1D deterministic LQ setting. It shows that monotone numerical discretisations, such as upwind schemes for solving the Hamilton–Jacobi–Bellman equation, are crucial to obtaining stable value/policy iterations and derives concrete conditions under which Q-learning preserves monotonicity, including step-size and feature-boundedness requirements. The study reports that in the discrete LQ problem, Q-learning is stable for $0 \le \alpha \le 1$ and can remain stable up to $\alpha = 1.3$ in some runs, but becomes unstable at $\alpha = 1.8$, with function approximation introducing additional monotonicity constraints. It argues that even linear approximators can disrupt monotonicity and that nonlinear approximators may underlie observed instability in practice, highlighting the need to enforce monotonicity to ensure convergence. These insights connect numerical monotonicity theory with reinforcement-learning stability and offer guidance for discretisation and approximation choices in RL.
Abstract
Stability issues with reinforcement learning methods persist. To better understand some of these stability and convergence issues involving deep reinforcement learning methods, we examine a simple linear quadratic example. We interpret the convergence criterion of exact Q-learning in the sense of a monotone scheme and discuss consequences of function approximation on monotonicity properties.
