ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control
Ehsan Futuhi, Shayan Karimi, Chao Gao, Martin Müller
TL;DR
ETGL-DDPG tackles sparse-reward reinforcement learning in continuous control by integrating three complementary techniques into DDPG: $εt$-greedy exploration with tree-search guided by state visitation counts, a Goal-conditioned Dual Replay Buffer (GDRB) that separates all experiences from successful ones with adaptive sampling, and longest $n$-step returns to propagate rewards quickly along successful trajectories. The authors prove polynomial (PAC-MDP) sample complexity for the $εt$-greedy component under mild assumptions and demonstrate superior empirical performance across a suite of sparse-reward navigation and manipulation tasks, with ablations confirming the contribution of each component. They further analyze environment coverage and reward propagation, showing directed exploration and dual-buffer learning yield more data-efficient training in sparse settings. Limitations include hashing-based visit counts and deterministic-domain focus, with future work on dynamic hashing and extending the approach to stochastic environments.
Abstract
We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{$ε{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $εt$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$εt$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.
