Table of Contents
Fetching ...

Experience Replay with Random Reshuffling

Yasuhiro Fujita

TL;DR

This paper tackles inefficiencies in reinforcement learning experience replay caused by sampling with replacement. It adapts random reshuffling (RR) from supervised learning to RL via two methods: RR-C for uniform experience replay and RR-M for prioritized experience replay, supported by theory and simulations. The approaches reduce variance in how often transitions are sampled and yield modest performance gains on Atari benchmarks across several algorithms, while remaining simple to implement. The work provides practical drop-in replacements for standard sampling and broadens the applicability of RR in RL contexts.

Abstract

Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings, and analyze their properties via theoretical analysis and simulations. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning. Code is available at https://github.com/pfnet-research/errr.

Experience Replay with Random Reshuffling

TL;DR

This paper tackles inefficiencies in reinforcement learning experience replay caused by sampling with replacement. It adapts random reshuffling (RR) from supervised learning to RL via two methods: RR-C for uniform experience replay and RR-M for prioritized experience replay, supported by theory and simulations. The approaches reduce variance in how often transitions are sampled and yield modest performance gains on Atari benchmarks across several algorithms, while remaining simple to implement. The work provides practical drop-in replacements for standard sampling and broadens the applicability of RR in RL contexts.

Abstract

Experience replay is a key component in reinforcement learning for stabilizing learning and improving sample efficiency. Its typical implementation samples transitions with replacement from a replay buffer. In contrast, in supervised learning with a fixed dataset, it is a common practice to shuffle the dataset every epoch and consume data sequentially, which is called random reshuffling (RR). RR enjoys theoretically better convergence properties and has been shown to outperform with-replacement sampling empirically. To leverage the benefits of RR in reinforcement learning, we propose sampling methods that extend RR to experience replay, both in uniform and prioritized settings, and analyze their properties via theoretical analysis and simulations. We evaluate our sampling methods on Atari benchmarks, demonstrating their effectiveness in deep reinforcement learning. Code is available at https://github.com/pfnet-research/errr.

Paper Structure

This paper contains 32 sections, 8 theorems, 1 equation, 11 figures, 9 tables.

Key Result

theorem 1

There exists a configuration such that $\mathbb{E} \left[ X_{i,k}^{\text{RR-C}} \right] \ne \mathbb{E} \left[ X_{i,k}^{\text{UER}} \right]$ for some $(i, k)$.

Figures (11)

  • Figure 1: Python code examples of sampling methods for uniform experience replay. This simplified code is for illustrative purposes and is not identical to actual implementations.
  • Figure 2: A Python code example of our sampling method for prioritized experience replay, RR-M. This simplified code is for illustrative purposes and is not identical to actual implementations.
  • Figure 3: Distributions of sample counts in experience replay simulations. We ran 100-timestep simulations with different random seeds for each configuration. For each transition, we visualize how many times it is sampled during a simulation. Solid lines represent mean sample counts, thick shaded areas represent mean$\pm$stdev, and light shaded areas represent minimum to maximum over 1000 simulations. WR: with-replacement sampling, WOR: without-replacement sampling, RR-C: RR with a circular buffer, RR-M: RR by masking.
  • Figure 4: Distributions of sample counts in experience replay simulations with a different set of parameters from \ref{['fig:sample_count']}: the minibatch size is 8. Format follows \ref{['fig:sample_count']}.
  • Figure 5: Distributions of sample counts in experience replay simulations with a different set of parameters from \ref{['fig:sample_count']}: the total timesteps is 1000, the capacity of the buffer size is 200, the size at which replay starts is 100, and $p_t = (t \bmod 250) + 50$. Format follows \ref{['fig:sample_count']}.
  • ...and 6 more figures

Theorems & Definitions (15)

  • theorem 1
  • proof
  • theorem 2: store*=rrc_bias
  • proof
  • theorem 3: store*=rrc_var
  • proof
  • theorem 4
  • proof
  • Lemma 5
  • proof
  • ...and 5 more