Table of Contents
Fetching ...

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

Jianshu Hu, Paul Weng, Yutong Ban

TL;DR

This work addresses sample inefficiency in deep reinforcement learning by tackling the exploration-exploitation trade-off with adaptive action persistence. It introduces SNAP, a state-novelty guided persistence adaptor that dynamically adjusts the probability of repeating the last action, without training extra value functions or policies; the probability is defined as $P(\pi'(s_t)=a_{t-1}) = \frac{\alpha}{\max(1, \sqrt{\tilde{N}(s_t)})}$, where $\tilde{N}(s_t)$ is a pseudo-count obtained via simhash on image features. Evaluations on DMControl tasks using DrQv2 as the base algorithm show that SNAP improves sample efficiency and exploration, especially when combined with common exploration strategies like $\epsilon$-greedy, entropy regularization (SAC), and NoisyNet. The method reduces computational overhead by avoiding extra learned components and provides a principled, gradual transition from temporally persistent exploration to fine-grained control as training progresses.

Abstract

While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

TL;DR

This work addresses sample inefficiency in deep reinforcement learning by tackling the exploration-exploitation trade-off with adaptive action persistence. It introduces SNAP, a state-novelty guided persistence adaptor that dynamically adjusts the probability of repeating the last action, without training extra value functions or policies; the probability is defined as , where is a pseudo-count obtained via simhash on image features. Evaluations on DMControl tasks using DrQv2 as the base algorithm show that SNAP improves sample efficiency and exploration, especially when combined with common exploration strategies like -greedy, entropy regularization (SAC), and NoisyNet. The method reduces computational overhead by avoiding extra learned components and provides a principled, gradual transition from temporally persistent exploration to fine-grained control as training progresses.

Abstract

While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.
Paper Structure (20 sections, 8 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: SNAP incorporates dynamic action persistences into the behavior policy of an off-policy DRL algorithm. In the initial stage, behavior policies with large action persistences make low-frequency decisions, which ensure temporally persistent exploration. Accordingly, fine-grained policies with small action persistences allow high-frequency actions, which is necessary for superior overall performance.
  • Figure 2: State coverage of executing different policies in the mini grid. The deeper color in the grid corresponds to larger probability to visit the state. The episode length and total time step in one run are set as (20, 1000) for the results in the first row and (100, 3000) for the results in the second row. The percentage of state coverage averaged over 30 runs is indicated in the title of each plot.
  • Figure 3: Overview. (a) An additional action persistence adaptor, guided by state-novelty, is integrated to an off-policy DRL framework. (b) Measure novelty directly from image input as the state is challenging. To address this, an image encoder is used to convert the images into feature vectors, which are then mapped to binary codes by a quantization encoder. By counting with the binary codes, we approximate the state distribution of the training data used for the actor or actor-critic. It guides the action persistence in the behavior policy.
  • Figure 4: DrQv2 with different action persistences. The results of comparing SNAP with using a fixed action persistence are shown in the first row. The sample efficiency of our method can not be achieved by simply tuning the action persistence as a hyperparameter. The probabilities averaged over 1000 frames are shown in the second row. Initially, a high probability of repeating actions enables temporally persistent exploration. As training progresses, this probability decreases, ensuring a fine-grained policy is employed in the later stages.
  • Figure 5: Performance of incorporating repeated actions. Two ways of incorporating temporally persistent exploration are compared with the baseline. Epsilon-zeta simply uses a fixed zeta distribution to decide the number of time steps for repeating the random actions. Our method dynamically determines repeating actions based on the state-novelty.
  • ...and 5 more figures