State-Novelty Guided Action Persistence in Deep Reinforcement Learning
Jianshu Hu, Paul Weng, Yutong Ban
TL;DR
This work addresses sample inefficiency in deep reinforcement learning by tackling the exploration-exploitation trade-off with adaptive action persistence. It introduces SNAP, a state-novelty guided persistence adaptor that dynamically adjusts the probability of repeating the last action, without training extra value functions or policies; the probability is defined as $P(\pi'(s_t)=a_{t-1}) = \frac{\alpha}{\max(1, \sqrt{\tilde{N}(s_t)})}$, where $\tilde{N}(s_t)$ is a pseudo-count obtained via simhash on image features. Evaluations on DMControl tasks using DrQv2 as the base algorithm show that SNAP improves sample efficiency and exploration, especially when combined with common exploration strategies like $\epsilon$-greedy, entropy regularization (SAC), and NoisyNet. The method reduces computational overhead by avoiding extra learned components and provides a principled, gradual transition from temporally persistent exploration to fine-grained control as training progresses.
Abstract
While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.
