Table of Contents
Fetching ...

Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Ethan Blaser, Jiuqi Wang, Shangtong Zhang

TL;DR

This work establishes almost-sure convergence guarantees for differential TD learning in average-reward MDPs without relying on a state-visit-based local clock in the learning rates. By formulating an $n$-step differential TD update and recasting it as a stochastic-approximation problem with an augmented Markov chain, the authors show that the associated ODE at infinity is governed by $\frac{d}{dt}v(t) = -A v(t)$ with $A = D_\mu(I - P_\pi^n + \eta e e^\top)$. The main technical contribution is proving that $A$ is strictly positive stable under three conditions (on-policy, strictly positive $P_\pi^n$, or doubly stochastic $P_\pi^n$), via a rank-one perturbation analysis (D-stability) combined with a recent SA/Ode convergence result, which yields almost-sure convergence of $v_t$ to the differential Bellman equation solution set $\mathcal{V}_*$. These results bridge theoretical guarantees with practical learning, removing the need for a local clock and suggesting avenues for extending to differential Q-learning, while also highlighting open questions in $D$-stability and its RL implications. The work thus strengthens the theoretical foundations of differential TD and aligns convergence analysis more closely with common practice in RL.

Abstract

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

TL;DR

This work establishes almost-sure convergence guarantees for differential TD learning in average-reward MDPs without relying on a state-visit-based local clock in the learning rates. By formulating an -step differential TD update and recasting it as a stochastic-approximation problem with an augmented Markov chain, the authors show that the associated ODE at infinity is governed by with . The main technical contribution is proving that is strictly positive stable under three conditions (on-policy, strictly positive , or doubly stochastic ), via a rank-one perturbation analysis (D-stability) combined with a recent SA/Ode convergence result, which yields almost-sure convergence of to the differential Bellman equation solution set . These results bridge theoretical guarantees with practical learning, removing the need for a local clock and suggesting avenues for extending to differential Q-learning, while also highlighting open questions in -stability and its RL implications. The work thus strengthens the theoretical foundations of differential TD and aligns convergence analysis more closely with common practice in RL.

Abstract

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy -step differential TD for any using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy -step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.
Paper Structure (19 sections, 14 theorems, 62 equations, 2 figures)

This paper contains 19 sections, 14 theorems, 62 equations, 2 figures.

Key Result

Lemma 4.3

(Theorem 2.7 from bierkens2014singular). Let $B \in \mathbb{R}^{n\times n}$ and $v, w \in \mathbb{R}^{n}$. Then $B + vw^\top$ is strictly positive stable if: and either of the following conditions hold:

Figures (2)

  • Figure 1: Off‐policy convergence of $n$–step differential TD in a $5\times5$ gridworld ($n=3$) for various $\eta$. Although $\eta_0=0$ here, the algorithm is stable across $\eta$. We use a variant of root mean-squared value error from tsitsiklis1999average, denoted as ‘RMSVE (TVR)’, which measures the distance of the estimated values to the nearest solution that satisfies the Bellman equation \ref{['eq bellman equation']}. The trials are averaged over 30 seeds with shaded regions as 1 standard error. The experimental details and results for other $n$ values appear in Appendix C.
  • Figure 2: Off‐policy convergence of $n$–step differential TD in a $5\times5$ gridworld for various $n$ with fixed $\eta = 0.1$. See Section \ref{['sec: exp description']} for the complete experiment description. Despite $\eta_0 =0$, we still observe the convergence of differential TD with across several $n$ values.

Theorems & Definitions (23)

  • Definition 2.1
  • Lemma 4.3
  • Lemma 4.4
  • proof
  • Theorem 4.6
  • proof
  • Corollary 4.7
  • proof
  • Lemma 4.8: Lemma 2.11 from bierkens2014singular
  • Lemma 4.9
  • ...and 13 more