CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

Arda Sarp Yenicesu; Furkan B. Mutlu; Suleyman S. Kozat; Ozgur S. Oguz

CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

Arda Sarp Yenicesu, Furkan B. Mutlu, Suleyman S. Kozat, Ozgur S. Oguz

TL;DR

This work tackles inefficiencies and instability in off-policy deep RL caused by naive replay sampling, which bias learning toward older transitions. It introduces Corrected Uniform Experience Replay (CUER), a sampling scheme that dynamically adjusts transition priorities to preserve fairness across the full history while nudging the distribution toward the on-policy behavior, with Pr$(t_i) = P(t_i)/\sum_j P(j)$ and initialization $P(t_i) = \frac{\text{batch\_size}}{\Psi}$. The approach is implemented efficiently via sum-tree structures and validated on MuJoCo continuous-control tasks with TD3 and SAC, showing faster convergence, lower variance, and stronger final performance than standard uniform sampling and traditional PER/CER baselines; combining CUER with CER yields further gains. The results demonstrate CUER’s robustness to buffer size and its compatibility with existing replay enhancements, making it a practical improvement for off-policy reinforcement learning systems.

Abstract

The utilization of the experience replay mechanism enables agents to effectively leverage their experiences on several occasions. In previous studies, the sampling probability of the transitions was modified based on their relative significance. The process of reassigning sample probabilities for every transition in the replay buffer after each iteration is considered extremely inefficient. Hence, in order to enhance computing efficiency, experience replay prioritization algorithms reassess the importance of a transition as it is sampled. However, the relative importance of the transitions undergoes dynamic adjustments when the agent's policy and value function are iteratively updated. Furthermore, experience replay is a mechanism that retains the transitions generated by the agent's past policies, which could potentially diverge significantly from the agent's most recent policy. An increased deviation from the agent's most recent policy results in a greater frequency of off-policy updates, which has a negative impact on the agent's performance. In this paper, we develop a novel algorithm, Corrected Uniform Experience Replay (CUER), which stochastically samples the stored experience while considering the fairness among all other experiences without ignoring the dynamic nature of the transition importance by making sampled state distribution more on-policy. CUER provides promising improvements for off-policy continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.

CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

TL;DR

and initialization

. The approach is implemented efficiently via sum-tree structures and validated on MuJoCo continuous-control tasks with TD3 and SAC, showing faster convergence, lower variance, and stronger final performance than standard uniform sampling and traditional PER/CER baselines; combining CUER with CER yields further gains. The results demonstrate CUER’s robustness to buffer size and its compatibility with existing replay enhancements, making it a practical improvement for off-policy reinforcement learning systems.

Abstract

Paper Structure (16 sections, 5 equations, 3 figures)

This paper contains 16 sections, 5 equations, 3 figures.

Introduction
Related Work
Background
Reinforcement Learning
Twin Delayed Deep Deterministic Policy Gradient (TD3)
Soft Actor-Critic (SAC)
Experience Replay Methods
Corrected Uniform Experience Replay (CUER)
Motivation
Proposed Strategy: Dynamic Transition Priority Adjustment
Experiments
Task Selection
Benchmark Results
Comparison with CER
Investigation of Different Buffer Sizes
...and 1 more sections

Figures (3)

Figure 1: Comparison of CUER with SOTA baselines in various environments.
Figure 2: Comparison of TD3_CER and TD3_CER_CUER in various environments.
Figure 3: Comparison of of CUER with Uniform Sampling having different buffer sizes in various environments.

CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

TL;DR

Abstract

CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (3)