Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation

Donghwan Lee

Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation

Donghwan Lee

TL;DR

The paper tackles instability in off-policy TD learning with linear function approximation by analyzing $n$-step methods. It first develops model-based deterministic analogues (n-step projected value iteration, gradient-based formulations, and a system-operator framework) and proves contraction or strong convexity when the horizon $n$ is large enough, yielding a well-defined fixed point $ heta_*^n$. Building on these results, it introduces two model-free stochastic algorithms, $n$-TD and $n$-GTD, and shows they converge to $ heta_*^n$ for sufficiently large $n$ under standard step-size assumptions. Collectively, the work provides convergence guarantees for $n$-step TD methods under the deadly triad, bridging deterministic analyses with stochastic RL and guiding practical selection of $n$ to ensure stable learning.

Abstract

This paper analyzes multi-step TD-learning algorithms within the `deadly triad' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that n-step TD-learning algorithms converge to a solution as the sampling horizon n increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, and the control theoretic approach, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when n is sufficiently large. Based on these findings, two n-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the gradient and control theoretic algorithms.

Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation

TL;DR

The paper tackles instability in off-policy TD learning with linear function approximation by analyzing

-step methods. It first develops model-based deterministic analogues (n-step projected value iteration, gradient-based formulations, and a system-operator framework) and proves contraction or strong convexity when the horizon

is large enough, yielding a well-defined fixed point

. Building on these results, it introduces two model-free stochastic algorithms,

-TD and

-GTD, and shows they converge to

for sufficiently large

under standard step-size assumptions. Collectively, the work provides convergence guarantees for

-step TD methods under the deadly triad, bridging deterministic analyses with stochastic RL and guiding practical selection of

to ensure stable learning.

Abstract

Paper Structure (12 sections, 24 theorems, 59 equations, 2 algorithms)

This paper contains 12 sections, 24 theorems, 59 equations, 2 algorithms.

Introduction
Preliminaries
Notation
Markov decision process
Review of GTD algorithm
Multi-step projected Bellman operator
Gradient operator I
Gradient operator II
System operator
Off-policy multi-step TD-learning based on the system operator
Off-policy $n$-step TD-learning based on the gradient operator
Conclusion

Key Result

Lemma 1

Suppose that assumption:2 holds. Then, the following statements hold true:

Theorems & Definitions (44)

Lemma 1: lee2023new
Lemma 2
proof
Lemma 3
Theorem 1
proof
Theorem 2
proof
Lemma 4
proof
...and 34 more

Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation

TL;DR

Abstract

Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (44)