Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation
Han-Dong Lim, Donghwan Lee
TL;DR
The paper addresses the instability of off-policy TD learning with linear function approximation in the presence of bootstrapping (the deadly triad) by analyzing both model-based deterministic counterparts and model-free reinforcement learning algorithms. It shows that the projected $n$-step Bellman operator becomes contractive for sufficiently large $n$, enabling convergence of $n$-step value iterations; it then develops two model-free off-policy $n$-step TD algorithms and proves their almost-sure convergence under i.i.d. and Markov observation models via an ODE-based stochastic approximation framework. The work derives explicit finite-$n$ conditions (e.g., Schur/Hurwitz criteria) and provides bounds that scale only logarithmically with problem factors, connecting contraction properties to the stability of the learning dynamics. These results offer principled guidance for choosing $n$ in practice to stabilize off-policy multi-step TD learning with linear function approximation, and they deepen the theoretical understanding of how multi-step updates can overcome divergence in the deadly triad. Overall, the contributions bridge deterministic and stochastic analyses to enable robust off-policy reinforcement learning in settings where bootstrapping and function approximation interact unfavorably.
Abstract
This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that $n$-step TD-learning algorithms converge to a solution as the sampling horizon $n$ increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when $n$ is sufficiently large. Based on these findings, in the second part, two $n$-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.
