Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation
Gandharv Patil, Prashanth L. A., Dheeraj Nagaraj, Doina Precup
TL;DR
This work provides a finite-time analysis of tail-averaged temporal-difference learning with linear function approximation, achieving an optimal $O\left(1/t\right)$ convergence rate using a universal step-size that does not require knowledge of eigenvalues. It introduces tail-averaged TD and tail-averaged TD with regularisation, proving expectation and high-probability bounds, and showing that averaging final iterates yields exponential forgetting of the initial error while keeping variance at $O\left(1/t\right)$. The regularised variant targets the fixed point of $(A+\lambda I)^{-1} b$ and can offer improved bounds in ill-conditioned settings, with a bound on the distance to the vanilla TD fixed point scaling as $O(\lambda)$. The analysis also extends to Markov sampling via mixing arguments, indicating the results are robust beyond iid data, and contrasts favorably with prior work that required eigenvalue information or projection constraints. Overall, the results provide practically tunable, interpretable finite-time guarantees for TD with linear function approximation under tail-averaging and regularisation.
Abstract
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
