ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

Ehsan Futuhi; Shayan Karimi; Chao Gao; Martin Müller

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

Ehsan Futuhi, Shayan Karimi, Chao Gao, Martin Müller

TL;DR

ETGL-DDPG tackles sparse-reward reinforcement learning in continuous control by integrating three complementary techniques into DDPG: $εt$-greedy exploration with tree-search guided by state visitation counts, a Goal-conditioned Dual Replay Buffer (GDRB) that separates all experiences from successful ones with adaptive sampling, and longest $n$-step returns to propagate rewards quickly along successful trajectories. The authors prove polynomial (PAC-MDP) sample complexity for the $εt$-greedy component under mild assumptions and demonstrate superior empirical performance across a suite of sparse-reward navigation and manipulation tasks, with ablations confirming the contribution of each component. They further analyze environment coverage and reward propagation, showing directed exploration and dual-buffer learning yield more data-efficient training in sparse settings. Limitations include hashing-based visit counts and deterministic-domain focus, with future work on dynamic hashing and extending the approach to stochastic environments.

Abstract

We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{$ε{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $εt$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$εt$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

TL;DR

ETGL-DDPG tackles sparse-reward reinforcement learning in continuous control by integrating three complementary techniques into DDPG:

-greedy exploration with tree-search guided by state visitation counts, a Goal-conditioned Dual Replay Buffer (GDRB) that separates all experiences from successful ones with adaptive sampling, and longest

-step returns to propagate rewards quickly along successful trajectories. The authors prove polynomial (PAC-MDP) sample complexity for the

-greedy component under mild assumptions and demonstrate superior empirical performance across a suite of sparse-reward navigation and manipulation tasks, with ablations confirming the contribution of each component. They further analyze environment coverage and reward propagation, showing directed exploration and dual-buffer learning yield more data-efficient training in sparse settings. Limitations include hashing-based visit counts and deterministic-domain focus, with future work on dynamic hashing and extending the approach to stochastic environments.

Abstract

We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{

-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using

-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{

}-greedy, \textbf{G}DRB, and \textbf{L}ongest

-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

Paper Structure (24 sections, 4 theorems, 18 equations, 13 figures, 5 tables, 3 algorithms)

This paper contains 24 sections, 4 theorems, 18 equations, 13 figures, 5 tables, 3 algorithms.

Introduction
Background
Deep Deterministic Policy Gradient (DDPG)
Locality-Sensitive Hashing
The ETGL-DDPG Method
$\epsilon t$-Greedy: Exploration With Search
GDRB: Goal-conditioned Dual Replay Buffer
Using Longest $n$-step Return
Experiments
Overall Performance of ETGL-DDPG
Environment Coverage through Exploration
Effectiveness of Each New Component in ETGL-DDPG
Related Work
Conclusions and Future Work
Appendix
...and 9 more sections

Key Result

Theorem 1

Given a tree $\mathcal{X}$ with $N$ nodes ($s_1$ to $s_N$), for any $\omega \in \Omega_{\mathcal{X}}$, the sampling probability satisfies: , if$N \le \frac{\log(|\mathcal{S}||{\mathcal{A}}|)}{\log\log(|\mathcal{S}||{\mathcal{A}}|)}$. Here, $\mathcal{S}$ and $\mathcal{A}$ represent the state space and action space, respectively.

Figures (13)

Figure 1: (a): $\epsilon t$-greedy exploration strategy. The agent creates a tree from the current state $s_{t}$ with $\epsilon$ probability. Otherwise, it uses its policy to determine the next action $a_{t}\sim \pi$. The tree uses a hash function $\phi$ to estimate the visit counts to states. If the newly added node $s_{x}$ to the tree is located in an unvisited area $n(\phi(s_{x}))=0$, the path from the root to that node is returned as option $O$. The tree helps in avoiding obstacles, discovering unexplored areas, and staying away from highly-visited regions (middle red area). (b): GDRB and the longest n-step return for Q-value updates. $\tau_{1}$ reaches the goal (a successful episode), and $\tau_{2}$ is truncated by time limit (an unsuccessful episode). The first buffer $D_{\beta}$ stores both trajectories but $D_{e}$ only stores successful trajectories. The target Q-value for state $s_{t}$ is shown for both trajectories below the figure. In successful episodes, the target Q-value is the episode return. $s_T$ represents the last state in each episode, which is the goal state indicated by a star in $\tau_1$.
Figure 2: The environments used in our experiments.
Figure 3: The success rates across all environments, averaged over 5 runs with different random seeds. Shaded areas represent one standard deviation. We trained all methods for 6 million frames in the navigation environments and 2 million frames in the manipulation environments, with success rates reported at every $10^{5}$-step checkpoint. A moving average with a window size of 10 is applied to all methods for better readability.
Figure 4: The environment coverage for exploration strategies in navigation environments. On the graph, the y-axis indicates the portion of the environment that has been covered, and the checkpoints occur every $10^{4}$ steps shown on the x-axis. The results are given for the average of 10 runs with random seeds. The shaded region represents one standard deviation.
Figure 5: The distribution of options chosen in training. The x-axis represents the length of the options and the y-axis indicates the probability of each length, calculated based on how often each length is chosen across all options.
...and 8 more figures

Theorems & Definitions (9)

Definition 1: Covering Length
Definition 2: $\epsilon$-optimal Policy
Definition 3: PAC-MDP Algorithm
Theorem 1: Worst-Case Sampling
Theorem 2: $\epsilon t$-greedy Sample Efficiency
Theorem 3: Worst-Case Sampling
proof
Theorem 4: $\epsilon t$-greedy Sample Efficiency
proof

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

TL;DR

Abstract

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (9)