Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil; Prashanth L. A.; Dheeraj Nagaraj; Doina Precup

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, Prashanth L. A., Dheeraj Nagaraj, Doina Precup

TL;DR

This work provides a finite-time analysis of tail-averaged temporal-difference learning with linear function approximation, achieving an optimal $O\left(1/t\right)$ convergence rate using a universal step-size that does not require knowledge of eigenvalues. It introduces tail-averaged TD and tail-averaged TD with regularisation, proving expectation and high-probability bounds, and showing that averaging final iterates yields exponential forgetting of the initial error while keeping variance at $O\left(1/t\right)$. The regularised variant targets the fixed point of $(A+\lambda I)^{-1} b$ and can offer improved bounds in ill-conditioned settings, with a bound on the distance to the vanilla TD fixed point scaling as $O(\lambda)$. The analysis also extends to Markov sampling via mixing arguments, indicating the results are robust beyond iid data, and contrasts favorably with prior work that required eigenvalue information or projection constraints. Overall, the results provide practically tunable, interpretable finite-time guarantees for TD with linear function approximation under tail-averaging and regularisation.

Abstract

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

TL;DR

This work provides a finite-time analysis of tail-averaged temporal-difference learning with linear function approximation, achieving an optimal

convergence rate using a universal step-size that does not require knowledge of eigenvalues. It introduces tail-averaged TD and tail-averaged TD with regularisation, proving expectation and high-probability bounds, and showing that averaging final iterates yields exponential forgetting of the initial error while keeping variance at

. The regularised variant targets the fixed point of

and can offer improved bounds in ill-conditioned settings, with a bound on the distance to the vanilla TD fixed point scaling as

. The analysis also extends to Markov sampling via mixing arguments, indicating the results are robust beyond iid data, and contrasts favorably with prior work that required eigenvalue information or projection constraints. Overall, the results provide practically tunable, interpretable finite-time guarantees for TD with linear function approximation under tail-averaging and regularisation.

Abstract

rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

Paper Structure (37 sections, 32 theorems, 142 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 37 sections, 32 theorems, 142 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Related work.
TD with linear function approximation
Value function approximation
Temporal Difference (TD) Learning
Tail-averaged TD
Basic algorithm
Finite time bounds
Regularized TD Learning
Basic algorithm
Finite time bounds
Proof Ideas
Proof of Theorem \ref{['thm:expectation_bound']} (Sketch)
Proof of Theorem \ref{['thm:high-prob-bound']} (Sketch)
Proof of Theorem \ref{['thm:regtd-expectation_bound']} (Sketch)
...and 22 more sections

Key Result

Theorem 1

Suppose asm:stationaryasm:phiFullRank hold. Choose a step size $\gamma$ satisfying where $\beta$ is the discount factor and $\Phi_{\mathsf{max}}$ is a bound on the features (see Assumption asm:bddFeatures). Then the expected error of the tail-averaged iterate $\mathbf{\theta}_{k+1,N}$ when using Algorithm alg:ciac-a satisfies where $N = t - k$, $\theta_0$ is the initial point, $\sigma^2 = (R_{\m

Figures (1)

Figure 1: A two state Markov chain

Theorems & Definitions (74)

Theorem 1: Bound in expectation
proof
Remark 1
Remark 2
Remark 3
Remark 4
Remark 5
Remark 6
Theorem 2: High-probability bound
proof
...and 64 more

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

TL;DR

Abstract

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (74)