State-Novelty Guided Action Persistence in Deep Reinforcement Learning

Jianshu Hu; Paul Weng; Yutong Ban

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

Jianshu Hu, Paul Weng, Yutong Ban

TL;DR

This work addresses sample inefficiency in deep reinforcement learning by tackling the exploration-exploitation trade-off with adaptive action persistence. It introduces SNAP, a state-novelty guided persistence adaptor that dynamically adjusts the probability of repeating the last action, without training extra value functions or policies; the probability is defined as $P(\pi'(s_t)=a_{t-1}) = \frac{\alpha}{\max(1, \sqrt{\tilde{N}(s_t)})}$, where $\tilde{N}(s_t)$ is a pseudo-count obtained via simhash on image features. Evaluations on DMControl tasks using DrQv2 as the base algorithm show that SNAP improves sample efficiency and exploration, especially when combined with common exploration strategies like $\epsilon$-greedy, entropy regularization (SAC), and NoisyNet. The method reduces computational overhead by avoiding extra learned components and provides a principled, gradual transition from temporally persistent exploration to fine-grained control as training progresses.

Abstract

While a powerful and promising approach, deep reinforcement learning (DRL) still suffers from sample inefficiency, which can be notably improved by resorting to more sophisticated techniques to address the exploration-exploitation dilemma. One such technique relies on action persistence (i.e., repeating an action over multiple steps). However, previous work exploiting action persistence either applies a fixed strategy or learns additional value functions (or policy) for selecting the repetition number. In this paper, we propose a novel method to dynamically adjust the action persistence based on the current exploration status of the state space. In such a way, our method does not require training of additional value functions or policy. Moreover, the use of a smooth scheduling of the repeat probability allows a more effective balance between exploration and exploitation. Furthermore, our method can be seamlessly integrated into various basic exploration strategies to incorporate temporal persistence. Finally, extensive experiments on different DMControl tasks demonstrate that our state-novelty guided action persistence method significantly improves the sample efficiency.

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

TL;DR

, where

is a pseudo-count obtained via simhash on image features. Evaluations on DMControl tasks using DrQv2 as the base algorithm show that SNAP improves sample efficiency and exploration, especially when combined with common exploration strategies like

-greedy, entropy regularization (SAC), and NoisyNet. The method reduces computational overhead by avoiding extra learned components and provides a principled, gradual transition from temporally persistent exploration to fine-grained control as training progresses.

Abstract

Paper Structure (20 sections, 8 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 10 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Background
Problem Formulation
Actor-Critic Framework
Effect of Action Persistence
Large Action Persistence
Small Action Persistence
Methodology
SNAP Framework
Persistence Adaptor
Algorithm
Experimental Results
Different Action Persistences
Exploiting Repeated Actions
...and 5 more sections

Figures (10)

Figure 1: SNAP incorporates dynamic action persistences into the behavior policy of an off-policy DRL algorithm. In the initial stage, behavior policies with large action persistences make low-frequency decisions, which ensure temporally persistent exploration. Accordingly, fine-grained policies with small action persistences allow high-frequency actions, which is necessary for superior overall performance.
Figure 2: State coverage of executing different policies in the mini grid. The deeper color in the grid corresponds to larger probability to visit the state. The episode length and total time step in one run are set as (20, 1000) for the results in the first row and (100, 3000) for the results in the second row. The percentage of state coverage averaged over 30 runs is indicated in the title of each plot.
Figure 3: Overview. (a) An additional action persistence adaptor, guided by state-novelty, is integrated to an off-policy DRL framework. (b) Measure novelty directly from image input as the state is challenging. To address this, an image encoder is used to convert the images into feature vectors, which are then mapped to binary codes by a quantization encoder. By counting with the binary codes, we approximate the state distribution of the training data used for the actor or actor-critic. It guides the action persistence in the behavior policy.
Figure 4: DrQv2 with different action persistences. The results of comparing SNAP with using a fixed action persistence are shown in the first row. The sample efficiency of our method can not be achieved by simply tuning the action persistence as a hyperparameter. The probabilities averaged over 1000 frames are shown in the second row. Initially, a high probability of repeating actions enables temporally persistent exploration. As training progresses, this probability decreases, ensuring a fine-grained policy is employed in the later stages.
Figure 5: Performance of incorporating repeated actions. Two ways of incorporating temporally persistent exploration are compared with the baseline. Epsilon-zeta simply uses a fixed zeta distribution to decide the number of time steps for repeating the random actions. Our method dynamically determines repeating actions based on the state-novelty.
...and 5 more figures

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

TL;DR

Abstract

State-Novelty Guided Action Persistence in Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)