Table of Contents
Fetching ...

Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation

Han-Dong Lim, Donghwan Lee

TL;DR

The paper addresses the instability of off-policy TD learning with linear function approximation in the presence of bootstrapping (the deadly triad) by analyzing both model-based deterministic counterparts and model-free reinforcement learning algorithms. It shows that the projected $n$-step Bellman operator becomes contractive for sufficiently large $n$, enabling convergence of $n$-step value iterations; it then develops two model-free off-policy $n$-step TD algorithms and proves their almost-sure convergence under i.i.d. and Markov observation models via an ODE-based stochastic approximation framework. The work derives explicit finite-$n$ conditions (e.g., Schur/Hurwitz criteria) and provides bounds that scale only logarithmically with problem factors, connecting contraction properties to the stability of the learning dynamics. These results offer principled guidance for choosing $n$ in practice to stabilize off-policy multi-step TD learning with linear function approximation, and they deepen the theoretical understanding of how multi-step updates can overcome divergence in the deadly triad. Overall, the contributions bridge deterministic and stochastic analyses to enable robust off-policy reinforcement learning in settings where bootstrapping and function approximation interact unfavorably.

Abstract

This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that $n$-step TD-learning algorithms converge to a solution as the sampling horizon $n$ increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when $n$ is sufficiently large. Based on these findings, in the second part, two $n$-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.

Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation

TL;DR

The paper addresses the instability of off-policy TD learning with linear function approximation in the presence of bootstrapping (the deadly triad) by analyzing both model-based deterministic counterparts and model-free reinforcement learning algorithms. It shows that the projected -step Bellman operator becomes contractive for sufficiently large , enabling convergence of -step value iterations; it then develops two model-free off-policy -step TD algorithms and proves their almost-sure convergence under i.i.d. and Markov observation models via an ODE-based stochastic approximation framework. The work derives explicit finite- conditions (e.g., Schur/Hurwitz criteria) and provides bounds that scale only logarithmically with problem factors, connecting contraction properties to the stability of the learning dynamics. These results offer principled guidance for choosing in practice to stabilize off-policy multi-step TD learning with linear function approximation, and they deepen the theoretical understanding of how multi-step updates can overcome divergence in the deadly triad. Overall, the contributions bridge deterministic and stochastic analyses to enable robust off-policy reinforcement learning in settings where bootstrapping and function approximation interact unfavorably.

Abstract

This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that -step TD-learning algorithms converge to a solution as the sampling horizon increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when is sufficiently large. Based on these findings, in the second part, two -step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.

Paper Structure

This paper contains 15 sections, 16 theorems, 31 equations, 1 figure, 2 algorithms.

Key Result

Lemma 1

Under Assumption assumption:2, the following statements hold true:

Figures (1)

  • Figure D.1: Due to high variance, we clipped the importance sampling ratio. Instead of $\rho$, we used $\min\{\rho,9\}$.

Theorems & Definitions (28)

  • Definition 1: Policy evaluation problem
  • Lemma 1: Lemma 3 in lee2023new
  • Lemma 2
  • Lemma 3
  • Definition 2
  • Theorem 1
  • Remark 1
  • Lemma 4: Corollary 5.6.16 in horn2012matrix
  • Theorem 2
  • Remark 2
  • ...and 18 more