Table of Contents
Fetching ...

Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with Decoupled Prioritized Experience Replay

Mehmet Efe Lorasdagi, Dogan Can Cicek, Furkan Burak Mutlu, Suleyman Serdar Kozat

TL;DR

This work tackles the mismatch between Actor and Critic learning signals in off-policy RL by introducing Decoupled Prioritized Experience Replay (DPER), which uses separate batches for the Actor and Critic. The Actor batching is guided to resemble on-policy data via a KL-divergence-driven selection over multiple candidate batches, while the Critic continues to leverage prioritized sampling. Integrated with TD3, DP​ER demonstrates improved performance on six MuJoCo continuous-control tasks, with small values of K (2–3) yielding the best trade-off between efficiency and gains. The results suggest decoupling replay signals for Actor–Critic components can enhance stability and final policy quality across a broad class of off-policy methods.

Abstract

Background: Deep Deterministic Policy Gradient-based reinforcement learning algorithms utilize Actor-Critic architectures, where both networks are typically trained using identical batches of replayed transitions. However, the learning objectives and update dynamics of the Actor and Critic differ, raising concerns about whether uniform transition usage is optimal. Objectives: We aim to improve the performance of deep deterministic policy gradient algorithms by decoupling the transition batches used to train the Actor and the Critic. Our goal is to design an experience replay mechanism that provides appropriate learning signals to each component by using separate, tailored batches. Methods: We introduce Decoupled Prioritized Experience Replay (DPER), a novel approach that allows independent sampling of transition batches for the Actor and the Critic. DPER can be integrated into any off-policy deep reinforcement learning algorithm that operates in continuous control domains. We combine DPER with the state-of-the-art Twin Delayed DDPG algorithm and evaluate its performance across standard continuous control benchmarks. Results: DPER outperforms conventional experience replay strategies such as vanilla experience replay and prioritized experience replay in multiple MuJoCo tasks from the OpenAI Gym suite. Conclusions: Our findings show that decoupling experience replay for Actor and Critic networks can enhance training dynamics and final policy quality. DPER offers a generalizable mechanism that enhances performance for a wide class of actor-critic off-policy reinforcement learning algorithms.

Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with Decoupled Prioritized Experience Replay

TL;DR

This work tackles the mismatch between Actor and Critic learning signals in off-policy RL by introducing Decoupled Prioritized Experience Replay (DPER), which uses separate batches for the Actor and Critic. The Actor batching is guided to resemble on-policy data via a KL-divergence-driven selection over multiple candidate batches, while the Critic continues to leverage prioritized sampling. Integrated with TD3, DP​ER demonstrates improved performance on six MuJoCo continuous-control tasks, with small values of K (2–3) yielding the best trade-off between efficiency and gains. The results suggest decoupling replay signals for Actor–Critic components can enhance stability and final policy quality across a broad class of off-policy methods.

Abstract

Background: Deep Deterministic Policy Gradient-based reinforcement learning algorithms utilize Actor-Critic architectures, where both networks are typically trained using identical batches of replayed transitions. However, the learning objectives and update dynamics of the Actor and Critic differ, raising concerns about whether uniform transition usage is optimal. Objectives: We aim to improve the performance of deep deterministic policy gradient algorithms by decoupling the transition batches used to train the Actor and the Critic. Our goal is to design an experience replay mechanism that provides appropriate learning signals to each component by using separate, tailored batches. Methods: We introduce Decoupled Prioritized Experience Replay (DPER), a novel approach that allows independent sampling of transition batches for the Actor and the Critic. DPER can be integrated into any off-policy deep reinforcement learning algorithm that operates in continuous control domains. We combine DPER with the state-of-the-art Twin Delayed DDPG algorithm and evaluate its performance across standard continuous control benchmarks. Results: DPER outperforms conventional experience replay strategies such as vanilla experience replay and prioritized experience replay in multiple MuJoCo tasks from the OpenAI Gym suite. Conclusions: Our findings show that decoupling experience replay for Actor and Critic networks can enhance training dynamics and final policy quality. DPER offers a generalizable mechanism that enhances performance for a wide class of actor-critic off-policy reinforcement learning algorithms.

Paper Structure

This paper contains 22 sections, 24 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Average cumulative rewards across tasks; shaded regions indicate half standard deviation.
  • Figure 2: K-sweep ablation: average cumulative rewards across tasks; shaded regions indicate half standard deviation.
  • Figure 3: Uniform-DPER: average cumulative rewards across tasks; shaded regions indicate half standard deviation.