Table of Contents
Fetching ...

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Gandharv Patil, Prashanth L. A., Dheeraj Nagaraj, Doina Precup

TL;DR

This work provides a finite-time analysis of tail-averaged temporal-difference learning with linear function approximation, achieving an optimal $O\left(1/t\right)$ convergence rate using a universal step-size that does not require knowledge of eigenvalues. It introduces tail-averaged TD and tail-averaged TD with regularisation, proving expectation and high-probability bounds, and showing that averaging final iterates yields exponential forgetting of the initial error while keeping variance at $O\left(1/t\right)$. The regularised variant targets the fixed point of $(A+\lambda I)^{-1} b$ and can offer improved bounds in ill-conditioned settings, with a bound on the distance to the vanilla TD fixed point scaling as $O(\lambda)$. The analysis also extends to Markov sampling via mixing arguments, indicating the results are robust beyond iid data, and contrasts favorably with prior work that required eigenvalue information or projection constraints. Overall, the results provide practically tunable, interpretable finite-time guarantees for TD with linear function approximation under tail-averaging and regularisation.

Abstract

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

TL;DR

This work provides a finite-time analysis of tail-averaged temporal-difference learning with linear function approximation, achieving an optimal convergence rate using a universal step-size that does not require knowledge of eigenvalues. It introduces tail-averaged TD and tail-averaged TD with regularisation, proving expectation and high-probability bounds, and showing that averaging final iterates yields exponential forgetting of the initial error while keeping variance at . The regularised variant targets the fixed point of and can offer improved bounds in ill-conditioned settings, with a bound on the distance to the vanilla TD fixed point scaling as . The analysis also extends to Markov sampling via mixing arguments, indicating the results are robust beyond iid data, and contrasts favorably with prior work that required eigenvalue information or projection constraints. Overall, the results provide practically tunable, interpretable finite-time guarantees for TD with linear function approximation under tail-averaging and regularisation.

Abstract

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
Paper Structure (37 sections, 32 theorems, 142 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 37 sections, 32 theorems, 142 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose asm:stationaryasm:phiFullRank hold. Choose a step size $\gamma$ satisfying where $\beta$ is the discount factor and $\Phi_{\mathsf{max}}$ is a bound on the features (see Assumption asm:bddFeatures). Then the expected error of the tail-averaged iterate $\mathbf{\theta}_{k+1,N}$ when using Algorithm alg:ciac-a satisfies where $N = t - k$, $\theta_0$ is the initial point, $\sigma^2 = (R_{\m

Figures (1)

  • Figure 1: A two state Markov chain

Theorems & Definitions (74)

  • Theorem 1: Bound in expectation
  • proof
  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Theorem 2: High-probability bound
  • proof
  • ...and 64 more