Table of Contents
Fetching ...

Towards Formalizing Reinforcement Learning Theory

Shangtong Zhang

TL;DR

This paper addresses the formal verification of almost sure convergence for $Q$-learning and linear TD with Markovian samples by implementing a Lean 4/Mathlib formalization built on the Robbins-Siegmund theorem and Lyapunov-function techniques. It replaces traditional ODE-based stochastic approximation proofs with a skeleton-anchored, measure-theoretic approach that formalizes the probability space of infinite sample paths via Ionescu-Tulcea, enabling rigorous treatment of Markovian noise. The authors provide formal statements for both LinearTD and $Q$-learning under Markovian and IID sampling, prove convergence under step sizes $\alpha_t = (t+2)^{-\nu}$ with $\nu \in (2/3,1)$, and outline a path toward rates and concentration results. This work represents a foundational step toward fully formalizing convergent RL results and offers a framework that can be extended to additional TD methods and more general Markov dynamics, with potential uses in benchmarking LLM reasoning and coding capabilities for formalized RL theory.

Abstract

In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.

Towards Formalizing Reinforcement Learning Theory

TL;DR

This paper addresses the formal verification of almost sure convergence for -learning and linear TD with Markovian samples by implementing a Lean 4/Mathlib formalization built on the Robbins-Siegmund theorem and Lyapunov-function techniques. It replaces traditional ODE-based stochastic approximation proofs with a skeleton-anchored, measure-theoretic approach that formalizes the probability space of infinite sample paths via Ionescu-Tulcea, enabling rigorous treatment of Markovian noise. The authors provide formal statements for both LinearTD and -learning under Markovian and IID sampling, prove convergence under step sizes with , and outline a path toward rates and concentration results. This work represents a foundational step toward fully formalizing convergent RL results and offers a framework that can be extended to additional TD methods and more general Markov dynamics, with potential uses in benchmarking LLM reasoning and coding capabilities for formalized RL theory.

Abstract

In this paper, we formalize the almost sure convergence of -learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. -learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at https://github.com/ShangtongZhang/rl-theory-in-lean.

Paper Structure

This paper contains 6 sections, 6 theorems, 12 equations.

Key Result

Theorem 3.1

Let the finite Markov chain $\qty{S_t}$ be irreducible and aperiodic. Let $X$ have a full column rank. Let the step size be $\alpha_t = \frac{1}{(t + 2)^\nu}$ with $\nu \in (2/ 3 , 1)$. Then the iterates $\qty{w_t}$ generated by eq linear td with $R_{t+1} \doteq r_\pi(S_t)$ satisfy that $\lim_{t\to\

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 4.4
  • Theorem 4.6: A special case of robbins1971convergence
  • Theorem 4.7