Table of Contents
Fetching ...

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

Ehsan Futuhi, Shayan Karimi, Chao Gao, Martin Müller

TL;DR

ETGL-DDPG tackles sparse-reward reinforcement learning in continuous control by integrating three complementary techniques into DDPG: $εt$-greedy exploration with tree-search guided by state visitation counts, a Goal-conditioned Dual Replay Buffer (GDRB) that separates all experiences from successful ones with adaptive sampling, and longest $n$-step returns to propagate rewards quickly along successful trajectories. The authors prove polynomial (PAC-MDP) sample complexity for the $εt$-greedy component under mild assumptions and demonstrate superior empirical performance across a suite of sparse-reward navigation and manipulation tasks, with ablations confirming the contribution of each component. They further analyze environment coverage and reward propagation, showing directed exploration and dual-buffer learning yield more data-efficient training in sparse settings. Limitations include hashing-based visit counts and deterministic-domain focus, with future work on dynamic hashing and extending the approach to stochastic environments.

Abstract

We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{$ε{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $εt$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$εt$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

TL;DR

ETGL-DDPG tackles sparse-reward reinforcement learning in continuous control by integrating three complementary techniques into DDPG: -greedy exploration with tree-search guided by state visitation counts, a Goal-conditioned Dual Replay Buffer (GDRB) that separates all experiences from successful ones with adaptive sampling, and longest -step returns to propagate rewards quickly along successful trajectories. The authors prove polynomial (PAC-MDP) sample complexity for the -greedy component under mild assumptions and demonstrate superior empirical performance across a suite of sparse-reward navigation and manipulation tasks, with ablations confirming the contribution of each component. They further analyze environment coverage and reward propagation, showing directed exploration and dual-buffer learning yield more data-efficient training in sparse settings. Limitations include hashing-based visit counts and deterministic-domain focus, with future work on dynamic hashing and extending the approach to stochastic environments.

Abstract

We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using -greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{}-greedy, \textbf{G}DRB, and \textbf{L}ongest -step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.
Paper Structure (24 sections, 4 theorems, 18 equations, 13 figures, 5 tables, 3 algorithms)

This paper contains 24 sections, 4 theorems, 18 equations, 13 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

Given a tree $\mathcal{X}$ with $N$ nodes ($s_1$ to $s_N$), for any $\omega \in \Omega_{\mathcal{X}}$, the sampling probability satisfies: , if$N \le \frac{\log(|\mathcal{S}||{\mathcal{A}}|)}{\log\log(|\mathcal{S}||{\mathcal{A}}|)}$. Here, $\mathcal{S}$ and $\mathcal{A}$ represent the state space and action space, respectively.

Figures (13)

  • Figure 1: (a): $\epsilon t$-greedy exploration strategy. The agent creates a tree from the current state $s_{t}$ with $\epsilon$ probability. Otherwise, it uses its policy to determine the next action $a_{t}\sim \pi$. The tree uses a hash function $\phi$ to estimate the visit counts to states. If the newly added node $s_{x}$ to the tree is located in an unvisited area $n(\phi(s_{x}))=0$, the path from the root to that node is returned as option $O$. The tree helps in avoiding obstacles, discovering unexplored areas, and staying away from highly-visited regions (middle red area). (b): GDRB and the longest n-step return for Q-value updates. $\tau_{1}$ reaches the goal (a successful episode), and $\tau_{2}$ is truncated by time limit (an unsuccessful episode). The first buffer $D_{\beta}$ stores both trajectories but $D_{e}$ only stores successful trajectories. The target Q-value for state $s_{t}$ is shown for both trajectories below the figure. In successful episodes, the target Q-value is the episode return. $s_T$ represents the last state in each episode, which is the goal state indicated by a star in $\tau_1$.
  • Figure 2: The environments used in our experiments.
  • Figure 3: The success rates across all environments, averaged over 5 runs with different random seeds. Shaded areas represent one standard deviation. We trained all methods for 6 million frames in the navigation environments and 2 million frames in the manipulation environments, with success rates reported at every $10^{5}$-step checkpoint. A moving average with a window size of 10 is applied to all methods for better readability.
  • Figure 4: The environment coverage for exploration strategies in navigation environments. On the graph, the y-axis indicates the portion of the environment that has been covered, and the checkpoints occur every $10^{4}$ steps shown on the x-axis. The results are given for the average of 10 runs with random seeds. The shaded region represents one standard deviation.
  • Figure 5: The distribution of options chosen in training. The x-axis represents the length of the options and the y-axis indicates the probability of each length, calculated based on how often each length is chosen across all options.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Definition 1: Covering Length
  • Definition 2: $\epsilon$-optimal Policy
  • Definition 3: PAC-MDP Algorithm
  • Theorem 1: Worst-Case Sampling
  • Theorem 2: $\epsilon t$-greedy Sample Efficiency
  • Theorem 3: Worst-Case Sampling
  • proof
  • Theorem 4: $\epsilon t$-greedy Sample Efficiency
  • proof