Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
Ethan Blaser, Jiuqi Wang, Shangtong Zhang
TL;DR
This work establishes almost-sure convergence guarantees for differential TD learning in average-reward MDPs without relying on a state-visit-based local clock in the learning rates. By formulating an $n$-step differential TD update and recasting it as a stochastic-approximation problem with an augmented Markov chain, the authors show that the associated ODE at infinity is governed by $\frac{d}{dt}v(t) = -A v(t)$ with $A = D_\mu(I - P_\pi^n + \eta e e^\top)$. The main technical contribution is proving that $A$ is strictly positive stable under three conditions (on-policy, strictly positive $P_\pi^n$, or doubly stochastic $P_\pi^n$), via a rank-one perturbation analysis (D-stability) combined with a recent SA/Ode convergence result, which yields almost-sure convergence of $v_t$ to the differential Bellman equation solution set $\mathcal{V}_*$. These results bridge theoretical guarantees with practical learning, removing the need for a local clock and suggesting avenues for extending to differential Q-learning, while also highlighting open questions in $D$-stability and its RL implications. The work thus strengthens the theoretical foundations of differential TD and aligns convergence analysis more closely with common practice in RL.
Abstract
The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.
