Statistical Inference for Temporal Difference Learning with Linear Function Approximation
Weichen Wu, Gen Li, Yuting Wei, Alessandro Rinaldo
TL;DR
This work advances statistical inference for Temporal Difference learning with linear function approximation by establishing nonasymptotic, high-probability bounds that scale with the asymptotic variance, under weaker conditions than prior results, and by proving a high-dimensional Berry-Esseen bound with rate $O(T^{-1/3})$. It also introduces an online plug-in estimator for the asymptotic covariance, enabling finite-sample confident regions for the linear value-function parameters, with provable coverage guarantees. The authors further compare their Berry-Esseen results to recent work and demonstrate the practical viability of the online covariance estimator in providing accurate Gaussian approximations for TD-based policy evaluation in high dimensions. Simulations corroborate the theory across various settings, highlighting the method’s efficiency and accuracy for constructing both individual and simultaneous confidence sets. Overall, the paper provides a principled, scalable framework for uncertainty quantification in TD learning with linear function approximation.
Abstract
We investigate the statistical properties of Temporal Difference (TD) learning with Polyak-Ruppert averaging, arguably one of the most widely used algorithms in reinforcement learning, for the task of estimating the parameters of the optimal linear approximation to the value function. Assuming independent samples, we make three theoretical contributions that improve upon the current state-of-the-art results: (i) we derive sharper high probability convergence guarantees that depend explicitly on the asymptotic variance and hold under weaker conditions than those adopted in the literature; (ii) we establish refined high-dimensional Berry-Esseen bounds over the class of convex sets, achieving faster rates than the best known results, and (iii) we propose and analyze a novel, computationally efficient online plug-in estimator of the asymptotic covariance matrix. These results enable the construction of confidence regions and simultaneous confidence intervals for the linear parameters of the value function approximation, with guaranteed finite-sample coverage. We demonstrate the applicability of our theoretical findings through numerical experiments.
