Towards Formalizing Reinforcement Learning Theory
Shangtong Zhang
TL;DR
This paper addresses the formal verification of almost sure convergence for $Q$-learning and linear TD with Markovian samples by implementing a Lean 4/Mathlib formalization built on the Robbins-Siegmund theorem and Lyapunov-function techniques. It replaces traditional ODE-based stochastic approximation proofs with a skeleton-anchored, measure-theoretic approach that formalizes the probability space of infinite sample paths via Ionescu-Tulcea, enabling rigorous treatment of Markovian noise. The authors provide formal statements for both LinearTD and $Q$-learning under Markovian and IID sampling, prove convergence under step sizes $\alpha_t = (t+2)^{-\nu}$ with $\nu \in (2/3,1)$, and outline a path toward rates and concentration results. This work represents a foundational step toward fully formalizing convergent RL results and offers a framework that can be extended to additional TD methods and more general Markov dynamics, with potential uses in benchmarking LLM reasoning and coding capabilities for formalized RL theory.
Abstract
In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.
