Table of Contents
Fetching ...

On the Convergence of Experience Replay in Policy Optimization: Characterizing Bias, Variance, and Finite-Time Convergence

Hua Zheng, Wei Xie, M. Ben Feng

TL;DR

The paper addresses the theoretical gaps in understanding experience replay for policy gradient methods by introducing an auxiliary Markov-chain framework with lag-based decoupling to disentangle Markov noise and policy drift. It derives finite-time bias bounds for LR/CLR estimators that depend on cumulative policy updates, mixing rates, and replay age, and provides a correlation-aware variance decomposition showing when replay reduces variance. Building on these characterizations, it establishes finite-time convergence guarantees for ER-based policy optimization, revealing a fundamental bias-variance trade-off: larger buffers offer variance reduction but increase data staleness bias. The results yield principled guidance for buffer sizing, replay schedules, and lag choices, connecting empirical heuristics with rigorous quantitative theory and highlighting the central role of environment mixing in ER effectiveness.

Abstract

Experience replay is a core ingredient of modern deep reinforcement learning, yet its benefits in policy optimization are poorly understood beyond empirical heuristics. This paper develops a novel theoretical framework for experience replay in modern policy gradient methods, where two sources of dependence fundamentally complicate analysis: Markovian correlations along trajectories and policy drift across optimization iterations. We introduce a new proof technique based on auxiliary Markov chains and lag-based decoupling that makes these dependencies tractable. Within this framework, we derive finite-time bias bounds for policy-gradient estimators under replay, identifying how bias scales with the cumulative policy update, the mixing time of the underlying dynamics, and the age of buffered data, thereby formalizing the practitioner's rule of avoiding overly stale replay. We further provide a correlation-aware variance decomposition showing how sample dependence governs gradient variance from replay and when replay is beneficial. Building on these characterizations, we establish the finite-time convergence guarantees for experience-replay-based policy optimization, explicitly quantifying how buffer size, sample correlation, and mixing jointly determine the convergence rate and revealing an inherent bias-variance trade-off: larger buffers can reduce variance by averaging less correlated samples but can increase bias as data become stale. These results offer a principled guide for buffer sizing and replay schedules, bridging prior empirical findings with quantitative theory.

On the Convergence of Experience Replay in Policy Optimization: Characterizing Bias, Variance, and Finite-Time Convergence

TL;DR

The paper addresses the theoretical gaps in understanding experience replay for policy gradient methods by introducing an auxiliary Markov-chain framework with lag-based decoupling to disentangle Markov noise and policy drift. It derives finite-time bias bounds for LR/CLR estimators that depend on cumulative policy updates, mixing rates, and replay age, and provides a correlation-aware variance decomposition showing when replay reduces variance. Building on these characterizations, it establishes finite-time convergence guarantees for ER-based policy optimization, revealing a fundamental bias-variance trade-off: larger buffers offer variance reduction but increase data staleness bias. The results yield principled guidance for buffer sizing, replay schedules, and lag choices, connecting empirical heuristics with rigorous quantitative theory and highlighting the central role of environment mixing in ER effectiveness.

Abstract

Experience replay is a core ingredient of modern deep reinforcement learning, yet its benefits in policy optimization are poorly understood beyond empirical heuristics. This paper develops a novel theoretical framework for experience replay in modern policy gradient methods, where two sources of dependence fundamentally complicate analysis: Markovian correlations along trajectories and policy drift across optimization iterations. We introduce a new proof technique based on auxiliary Markov chains and lag-based decoupling that makes these dependencies tractable. Within this framework, we derive finite-time bias bounds for policy-gradient estimators under replay, identifying how bias scales with the cumulative policy update, the mixing time of the underlying dynamics, and the age of buffered data, thereby formalizing the practitioner's rule of avoiding overly stale replay. We further provide a correlation-aware variance decomposition showing how sample dependence governs gradient variance from replay and when replay is beneficial. Building on these characterizations, we establish the finite-time convergence guarantees for experience-replay-based policy optimization, explicitly quantifying how buffer size, sample correlation, and mixing jointly determine the convergence rate and revealing an inherent bias-variance trade-off: larger buffers can reduce variance by averaging less correlated samples but can increase bias as data become stale. These results offer a principled guide for buffer sizing and replay schedules, bridging prior empirical findings with quantitative theory.

Paper Structure

This paper contains 26 sections, 34 theorems, 177 equations, 1 figure, 1 algorithm.

Key Result

Lemma 1

Under Assumption assumption 2, the policy gradient $\nabla J(\pmb\theta)$ is Lipschitz continuous, i.e., for any $\pmb\theta_1,\pmb\theta_2 \in\Theta$, there exists a constant $L > 0$ s.t.

Figures (1)

  • Figure 1: Illustration of our proof technique. To replay $(\pmb{s}_i,\pmb{a}_i)$ at iteration $k$, we condition on a $t$-lagged pair $(\pmb{s}_{i-t},\pmb\theta_{i-t})$ ($0\le t<i$) and compare three processes: (top) the original trajectory under drifting policies, (middle) an auxiliary MDP that replaces the policy sequence by the fixed policy $\pi_{\pmb\theta_{i-t}}$ to bound policy interdependence, and (bottom) a stationary chain, sampled from the fixed stationary distribution $d^{\pi_{\pmb\theta_{i-t}}}$, to bound the Markovian noise. The bias/convergence bounds follow by decomposing the error quantity into the total-variations, i.e. $\Vert \mathbb{P}(\text{Original MDP}|\pmb{s}_i,\pmb\theta_i) - \mathbb{P}(\text{Auxilary MDP}|\pmb{s}_i,\pmb\theta_i) \Vert_{TV}$ and $\Vert \mathbb{P}(\text{Auxilary MDP}|\pmb{s}_i,\pmb\theta_i) - \mathbb{P}(\text{Stationary MDP}|\pmb{s}_i,\pmb\theta_i) \Vert_{TV}$.

Theorems & Definitions (67)

  • Lemma 1: zhang2020global, Lemma 3.2
  • Lemma 2: Boundedness of Stochastic Policy Gradients
  • Definition 1: fedus2020revisiting
  • Definition 2: Mixing Time, wu2020finite
  • Theorem 3: Gradient Biasedness
  • Corollary 1
  • Proposition 1: Gradient Variance
  • Theorem 4: Convergence
  • Remark 1: L2 importance-weight norm
  • Corollary 2
  • ...and 57 more