Finite-Time Analysis of Temporal Difference Learning with Experience Replay

Han-Dong Lim; Donghwan Lee

Finite-Time Analysis of Temporal Difference Learning with Experience Replay

Han-Dong Lim, Donghwan Lee

TL;DR

The paper analyzes finite-time behavior of tabular on-policy TD-learning with experience replay under Markovian observation models. It introduces a noise-decomposition framework that links the convergence error to replay buffer size $N$ and mini-batch size $L$, enabling effective control of constant-step-size bias without imposing restrictive step-size conditions. The authors provide explicit finite-time bounds for both the averaged and final iterates, showing how increasing $L$ and $N$ reduces error terms and clarifying the roles of mixing time and Markovian correlations. This work sheds light on the theoretical benefits of experience replay in TD learning and offers practical guidance for buffer and batch sizing to improve convergence in RL algorithms.

Abstract

Temporal-difference (TD) learning is widely regarded as one of the most popular algorithms in reinforcement learning (RL). Despite its widespread use, it has only been recently that researchers have begun to actively study its finite time behavior, including the finite time bound on mean squared error and sample complexity. On the empirical side, experience replay has been a key ingredient in the success of deep RL algorithms, but its theoretical effects on RL have yet to be fully understood. In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer.

Finite-Time Analysis of Temporal Difference Learning with Experience Replay

TL;DR

and mini-batch size

, enabling effective control of constant-step-size bias without imposing restrictive step-size conditions. The authors provide explicit finite-time bounds for both the averaged and final iterates, showing how increasing

and

reduces error terms and clarifying the roles of mixing time and Markovian correlations. This work sheds light on the theoretical benefits of experience replay in TD learning and offers practical guidance for buffer and batch sizing to improve convergence in RL algorithms.

Abstract

Paper Structure (28 sections, 17 theorems, 68 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 28 sections, 17 theorems, 68 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Related works
Preliminaries
Markov chain
Markov decision process
Temporal difference learning
Main results
TD-learning with experience replay
Analysis framework
Bounds on noise
Averaged iterate convergence
Comparative analysis
Final iterate convergence
Comparative analysis
Conclusion
...and 13 more sections

Key Result

Lemma 2.3

If $\{S_k\}_{k\geq 0}$ is an irreducible and aperiodic Markov chain, then so is $\{(S_k,S_{k+1})\}_{k\geq 0}$.

Figures (1)

Figure 1: Diagram of TD-learning using experience replay

Theorems & Definitions (31)

Definition 2.1: Total variation distance levin2017markov
Lemma 2.3
Lemma 2.4
Lemma 3.1
Lemma 3.2: Properties of matrix $A$ lee2022analysis
Lemma 3.3
Lemma 3.4: Second moment of $\left\lVert w(M_k^{\pi},V_k)\right\rVert_2$
Theorem 3.5: Convergence rate on average iterate of TD-learning
Theorem 3.6
Lemma A.1
...and 21 more

Finite-Time Analysis of Temporal Difference Learning with Experience Replay

TL;DR

Abstract

Finite-Time Analysis of Temporal Difference Learning with Experience Replay

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (31)