Table of Contents
Fetching ...

Variance Reduction via Resampling and Experience Replay

Jiale Han, Xiaowu Dai, Yuhua Zhu

TL;DR

Variance Reduction via Resampling and Experience Replay develops a theoretical framework that models experience replay through resampled $U$- and $V$-statistics, establishing asymptotic variance reductions under conditions like $\lim_{n\to\infty} \frac{n}{Bk}=0$ and $k=o(n)$. The framework is applied to policy evaluation in MDPs with LSTD, continuous-time RL via a PDE-based PhiBE approach, and kernel ridge regression, showing both variance reduction and, for kernel methods, substantial computational savings from $O(n^3)$ to $O(n^2)$. Empirical results across the three domains corroborate variance reductions and stability gains, particularly in data-limited scenarios. The work provides a principled explanation for the effectiveness of experience replay and points to broad applicability beyond RL, including scalable kernel methods and potential extensions to federated and active learning. Overall, the paper offers a rigorous basis for leveraging replay to improve both efficiency and reliability in sequential learning tasks.

Abstract

Experience replay is a foundational technique in reinforcement learning that enhances learning stability by storing past experiences in a replay buffer and reusing them during training. Despite its practical success, its theoretical properties remain underexplored. In this paper, we present a theoretical framework that models experience replay using resampled $U$- and $V$-statistics, providing rigorous variance reduction guarantees. We apply this framework to policy evaluation tasks using the Least-Squares Temporal Difference (LSTD) algorithm and a Partial Differential Equation (PDE)-based model-free algorithm, demonstrating significant improvements in stability and efficiency, particularly in data-scarce scenarios. Beyond policy evaluation, we extend the framework to kernel ridge regression, showing that the experience replay-based method reduces the computational cost from the traditional $O(n^3)$ in time to as low as $O(n^2)$ in time while simultaneously reducing variance. Extensive numerical experiments validate our theoretical findings, demonstrating the broad applicability and effectiveness of experience replay in diverse machine learning tasks.

Variance Reduction via Resampling and Experience Replay

TL;DR

Variance Reduction via Resampling and Experience Replay develops a theoretical framework that models experience replay through resampled - and -statistics, establishing asymptotic variance reductions under conditions like and . The framework is applied to policy evaluation in MDPs with LSTD, continuous-time RL via a PDE-based PhiBE approach, and kernel ridge regression, showing both variance reduction and, for kernel methods, substantial computational savings from to . Empirical results across the three domains corroborate variance reductions and stability gains, particularly in data-limited scenarios. The work provides a principled explanation for the effectiveness of experience replay and points to broad applicability beyond RL, including scalable kernel methods and potential extensions to federated and active learning. Overall, the paper offers a rigorous basis for leveraging replay to improve both efficiency and reliability in sequential learning tasks.

Abstract

Experience replay is a foundational technique in reinforcement learning that enhances learning stability by storing past experiences in a replay buffer and reusing them during training. Despite its practical success, its theoretical properties remain underexplored. In this paper, we present a theoretical framework that models experience replay using resampled - and -statistics, providing rigorous variance reduction guarantees. We apply this framework to policy evaluation tasks using the Least-Squares Temporal Difference (LSTD) algorithm and a Partial Differential Equation (PDE)-based model-free algorithm, demonstrating significant improvements in stability and efficiency, particularly in data-scarce scenarios. Beyond policy evaluation, we extend the framework to kernel ridge regression, showing that the experience replay-based method reduces the computational cost from the traditional in time to as low as in time while simultaneously reducing variance. Extensive numerical experiments validate our theoretical findings, demonstrating the broad applicability and effectiveness of experience replay in diverse machine learning tasks.

Paper Structure

This paper contains 41 sections, 6 theorems, 67 equations, 14 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Let $Z_1,Z_2,\ldots,Z_n\stackrel{iid}{\sim}F_Z$ and $\tilde{\theta}_n$ defined in new_9, we have that $\sqrt{n}\left[ \tilde{\theta}_n-\theta\right]\xrightarrow{d} N(0,\Sigma_{}),$ where $\Sigma$ is a constant matrix given by with $G=\left([\mathbb{E}[g(Z)]]^{-1},-\theta^\top\otimes[\mathbb{E}[g(Z)]]^{-1} \right),$ where $\otimes$ denotes the Kronecker product, and $\text{vec}(A)$ reshapes a mat

Figures (14)

  • Figure 1: Variance reduction achieved by experience replay in policy evaluation using two approaches. $U$- and $V$- statistics methods incorporate experience replay without and with replacement, respectively, into the original method. The solid lines represent the mean estimates, and the shaded areas denote the 95% confidence intervals (CIs), calculated from 50 data replications.
  • Figure 2: Variance differences among the predicted policy values using the LSTD algorithm with $m = 50$, $M = 50$, and $k/n =0.3$, evaluated across various values of $n$ and $B$. $\tilde{V}(s^*_j)$ represents the results without experience replay, while $\hat{V}_{U}(s^*_j)$ and $\hat{V}_{V}(s^*_j)$ represent the results with experience replay. The red line represents the baseline where the variance difference is 0.
  • Figure 3: Variance differences among the predicted policy values using the second-order PDE-based algorithm with $m = 50$, $M = 50$, and $k/n=0.3$, evaluated across various values of $n$ and $B$. $\tilde{V}(s^*_j)$ represents the results without experience replay, while $\hat{V}_{U}(s^*_j)$ and $\hat{V}_{V}(s^*_j)$ represent the results with experience replay. The red line represents the baseline where the variance difference is 0.
  • Figure 4: Variance differences in predicted outcomes using kernel ridge regression on the simulated data with $M=100, m= 100$ and $B = 50$, evaluated across various values of $n$ and $k$. $\tilde{y}$ represents the results without experience replay, while $\hat{y}_{U}$ and $\hat{y}_{V}$ represent the results with experience replay. The red line represents the baseline where the variance difference is $0$.
  • Figure 5: Variance differences among the predicted policy values using the LSTD algorithm with $m = 50$ and $M = 50$, evaluated across various values of $n$, $B$, and $k/n$. $\tilde{V}(s^*_j)$ represents the results without experience replay, while $\hat{V}_{U}(s^*_j)$ and $\hat{V}_{V}(s^*_j)$ represent the results with experience replay. The red line represents the baseline where the variance difference is 0.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Lemma 1
  • Theorem 1: Variance Reduction for $U$-Statistics
  • Theorem 2: Variance Reduction for $V$-Statistics
  • Proof F.1
  • Proof F.2
  • Corollary 3
  • Lemma 2
  • Proof F.3
  • Lemma 3
  • Proof F.4
  • ...and 1 more