Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

Sergey Samsonov; Eric Moulines; Qi-Man Shao; Zhuo-Song Zhang; Alexey Naumov

Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, Alexey Naumov

TL;DR

This work develops non-asymptotic statistical inference for Polyak-Ruppert averaged linear stochastic approximation (LSA) under i.i.d. noise. It proves a Berry-Esseen bound for the multivariate normal approximation of $\sqrt{n}(\bar{\theta}_{n}-\theta^*)$, with the optimal rate $n^{-1/4}$ (up to log factors) attained by the aggressive step size $\alpha_k \sim k^{-1/2}$, and provides a non-asymptotic multiplier bootstrap that yields valid confidence sets without requiring knowledge of the asymptotic covariance. The results are specialized to temporal-difference learning with linear function approximation, including explicit stability conditions and constants. A numerical study on TD learning in Garnet environments demonstrates the predicted rates and the practical viability of bootstrap-based confidence intervals in online settings. Overall, the paper furnishes a framework for finite-sample normal approximation and bootstrap-based inference for online linear stochastic approximation, with clear implications for RL value-function estimation.

Abstract

In this paper, we obtain the Berry-Esseen bound for multivariate normal approximation for the Polyak-Ruppert averaged iterates of the linear stochastic approximation (LSA) algorithm with decreasing step size. Moreover, we prove the non-asymptotic validity of the confidence intervals for parameter estimation with LSA based on multiplier bootstrap. This procedure updates the LSA estimate together with a set of randomly perturbed LSA estimates upon the arrival of subsequent observations. We illustrate our findings in the setting of temporal difference learning with linear function approximation.

Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

TL;DR

, with the optimal rate

(up to log factors) attained by the aggressive step size

, and provides a non-asymptotic multiplier bootstrap that yields valid confidence sets without requiring knowledge of the asymptotic covariance. The results are specialized to temporal-difference learning with linear function approximation, including explicit stability conditions and constants. A numerical study on TD learning in Garnet environments demonstrates the predicted rates and the practical viability of bootstrap-based confidence intervals in online settings. Overall, the paper furnishes a framework for finite-sample normal approximation and bootstrap-based inference for online linear stochastic approximation, with clear implications for RL value-function estimation.

Abstract

Paper Structure (29 sections, 30 theorems, 266 equations, 1 figure)

This paper contains 29 sections, 30 theorems, 266 equations, 1 figure.

Introduction
Related works
Accuracy of normal approximation for LSA
Central limit theorem for Polyak-Ruppert averaged LSA iterates.
Discussion.
Multiplier bootstrap for LSA
Applications to the TD learning and numerical results
Numerical results.
Conclusion
Proofs for accuracy of normal approximation
Expansion of the error of LSA equipped with the Polyak-Ruppert averaging
Bounding the error of the LSA algorithm last iterate
Proof of \ref{['th:theo_1_iid']}
Proof of \ref{['th:shao2022_berry']}
Proof of auxiliary lemmas for \ref{['th:shao2022_berry']}.
...and 14 more sections

Key Result

Proposition 1

Let $-\bar{\mathbf{A}}$ be a Hurwitz matrix. Then for any $P = P^{\top} \succ \mathrm{I}$, there exists a unique matrix $Q = Q^{\top} \succ \mathrm{I}$, satisfying the Lyapunov equation $\bar{\mathbf{A}}^\top Q + Q \bar{\mathbf{A}} = P$. Moreover, setting where $\kappa_{Q} = \lambda_{\max}(Q)/\lambda_{\min}(Q)$, it holds for any $\alpha \in [0, \alpha_{\infty}]$ that $\alpha a \leq 1/2$, and

Figures (1)

Figure 1: Subfigure (a): Rescaled error $\sqrt{n}\norm{\bar{\theta}_{n} - \theta^\star}$, averaged over $N$ independent TD trajectories for different trajectory lengths $n$. Subfigure (b): approximate quantity $\Delta_n$ from \ref{['eq:approx_supremum_experiment']} for different powers $\gamma$ and $n$. Subfigure (c): $\Delta_n$, rescaled by a factor $n^{1/4}$, predicted by \ref{['th:shao2022_berry']}.

Theorems & Definitions (53)

Proposition 1
Theorem 1
Theorem 2
Remark 1
Remark 2
Theorem 3
Corollary 1
Remark 3
Proposition 2
Proposition 3
...and 43 more

Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

TL;DR

Abstract

Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (53)