Table of Contents
Fetching ...

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

Shuze Daniel Liu, Shuhang Chen, Shangtong Zhang

TL;DR

The paper extends the Borkar–Meyn stability framework from Martingale-difference noise to Markovian noise in stochastic approximation, addressing stability and convergence for RL algorithms with off-policy data and eligibility traces. By introducing a diminishing asymptotic rate of change and leveraging a subsequence/Arzelà–Ascoli analysis, it proves almost-sure boundedness and convergence to invariant sets of the limiting ODE, under comparatively weak assumptions. The results yield direct, almost-sure convergence guarantees for GTD(λ) and ETD(λ) in off-policy RL, even with unbounded traces, reducing reliance on projections or restrictive drift conditions. This framework broadens applicability to RL with linear function approximation and provides a principled path for future work on rates, CLTs, and further relaxations.

Abstract

Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e.g., stochastic gradient descent and temporal difference learning. One fundamental challenge in analyzing a stochastic approximation algorithm is to establish its stability, i.e., to show that the stochastic vector iterates are bounded almost surely. In this paper, we extend the celebrated Borkar-Meyn theorem for stability from the Martingale difference noise setting to the Markovian noise setting, which greatly improves its applicability in reinforcement learning, especially in those off-policy reinforcement learning algorithms with linear function approximation and eligibility traces. Central to our analysis is the diminishing asymptotic rate of change of a few functions, which is implied by both a form of the strong law of large numbers and a form of the law of the iterated logarithm.

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

TL;DR

The paper extends the Borkar–Meyn stability framework from Martingale-difference noise to Markovian noise in stochastic approximation, addressing stability and convergence for RL algorithms with off-policy data and eligibility traces. By introducing a diminishing asymptotic rate of change and leveraging a subsequence/Arzelà–Ascoli analysis, it proves almost-sure boundedness and convergence to invariant sets of the limiting ODE, under comparatively weak assumptions. The results yield direct, almost-sure convergence guarantees for GTD(λ) and ETD(λ) in off-policy RL, even with unbounded traces, reducing reliance on projections or restrictive drift conditions. This framework broadens applicability to RL with linear function approximation and provides a principled path for future work on rates, CLTs, and further relaxations.

Abstract

Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e.g., stochastic gradient descent and temporal difference learning. One fundamental challenge in analyzing a stochastic approximation algorithm is to establish its stability, i.e., to show that the stochastic vector iterates are bounded almost surely. In this paper, we extend the celebrated Borkar-Meyn theorem for stability from the Martingale difference noise setting to the Markovian noise setting, which greatly improves its applicability in reinforcement learning, especially in those off-policy reinforcement learning algorithms with linear function approximation and eligibility traces. Central to our analysis is the diminishing asymptotic rate of change of a few functions, which is implied by both a form of the strong law of large numbers and a form of the law of the iterated logarithm.
Paper Structure (16 sections, 20 theorems, 105 equations)

This paper contains 16 sections, 20 theorems, 105 equations.

Key Result

Theorem 7

Let Assumptions assumption: stationary distribution - assumption: lim h uniformly convergent hold. Let Assumption assumption: lln or assumption possion hold. Then the iterates $\qty{x_n}$ generated by eq: x n updates are stable, i.e.,

Theorems & Definitions (30)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Theorem 7
  • Corollary 8
  • Lemma 9
  • Lemma 10
  • ...and 20 more