Exploration and Anti-Exploration with Distributional Random Network Distillation

Kai Yang; Jian Tao; Jiafei Lyu; Xiu Li

Exploration and Anti-Exploration with Distributional Random Network Distillation

Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li

TL;DR

This paper tackles bonus inconsistency in Random Network Distillation (RND) by introducing Distributional RND (DRND), which distills a distribution over multiple fixed random targets to produce more discriminative intrinsic rewards. DRND defines two complementary bonuses, $b_1(x)=\|f_\theta(x)-\mu(x)\|^2$ and $b_2(x)=\sqrt{\frac{[f_\theta(x)]^2-\mu(x)^2}{B_2(x)-\mu(x)^2}}$, and combines them as $b(x)=\alpha b_1(x)+(1-\alpha) b_2(x)$, where $\mu(x)$ and $B_2(x)$ are the first two moments of the target distribution; the predictor $f_\theta$ learns to approximate the distributional variable $c(x)$ with $L(\theta)=\|f_\theta(x)-c(x)\|^2$. The authors show that the DRND predictor acts as a pseudo-count estimator, with an unbiased statistic $y(x)=\frac{f_*^2(x)-\mu(x)^2}{B_2(x)-\mu(x)^2}$ for $1/n$ and vanishing variance as data accumulate. Empirically, DRND improves exploration in online Atari (Montezuma's Revenge, Gravitar, Venture), Adroit, and Fetch tasks, and enhances offline performance in D4RL with SAC-DRND, outperforming several baselines while remaining computationally efficient. Overall, DRND provides a principled, distributional extension to RND that couples exploration with pseudo-count dynamics, yielding robust improvements across online and offline RL settings.

Abstract

Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND.

Exploration and Anti-Exploration with Distributional Random Network Distillation

TL;DR

and

, and combines them as

, where

and

are the first two moments of the target distribution; the predictor

learns to approximate the distributional variable

with

. The authors show that the DRND predictor acts as a pseudo-count estimator, with an unbiased statistic

for

and vanishing variance as data accumulate. Empirically, DRND improves exploration in online Atari (Montezuma's Revenge, Gravitar, Venture), Adroit, and Fetch tasks, and enhances offline performance in D4RL with SAC-DRND, outperforming several baselines while remaining computationally efficient. Overall, DRND provides a principled, distributional extension to RND that couples exploration with pseudo-count dynamics, yielding robust improvements across online and offline RL settings.

Abstract

Paper Structure (34 sections, 3 theorems, 22 equations, 14 figures, 18 tables, 2 algorithms)

This paper contains 34 sections, 3 theorems, 22 equations, 14 figures, 18 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Method
Bonus Inconsistencies in Random Network Distillation
Distill the target network of random distribution
The DRND predictor is secretly a pseudo-count model
Bonus of the DRND agent
Connections between DRND and prior methods
Experiment
Bonus prediction comparison
Performance on Online experiments
D4RL Offline experiments
Conclusion
Proof
...and 19 more sections

Key Result

Lemma 4.1

Let $\tilde{\theta}$ and $\bar{\theta}_i, i = 1, 2, \ldots, N$ be i.i.d. samples from $p(\theta)$. Given the linear model $f_\theta(x) = \theta^T x$, the expected mean squared error is where $\Sigma$ is the variance of $p(\theta)$.

Figures (14)

Figure 1: Bonus Heatmap of dataset distribution and RND bonus. The left image illustrates the dataset distribution, the middle image represents the RND bonus before training, and the right image represents the RND bonus after training. A more detailed change process is in Appendix \ref{['rndgif']}. Ideally, we aim for a uniform bonus distribution before any training and without exposure to the dataset. After extensive training, the expected bonus should inversely correlate with the dataset distribution. The bonus distribution of RND is inconsistent with the desired distribution, indicating a problem with bonus inconsistency. The details of the experiment settings can be found in Appendix \ref{['app_exp_detail']}
Figure 2: Diagram of RND and DRND. Compared to the RND method that only distills a fixed target network, our method distills a randomly distributed target network and utilizes statistical metrics to assign a bonus to each state.
Figure 3: Distribution of DRND bonus. The dataset distribution is the same as Figure \ref{['bonus inconsistency']}. These illustrations depict the distribution of the DRND bonus, including the first bonus and the second bonus. The first bonus is predominant before training, and the second bonus becomes more prominent after training.
Figure 4: Inconsistency experiments mentioned in Section \ref{['sec: inconsistency exp']}. We plot the intrinsic reward distribution of RND and DRND before and after training on a mini-dataset. Left: the box plot of the difference between the maximum and minimum intrinsic rewards over 10 independent runs before training. Right: the intrinsic rewards for each data point after training.
Figure 5: Mean episodic return of DRND method, RND method, and baseline PPO method on three Atari games. All curves are averaged over 5 runs.
...and 9 more figures

Theorems & Definitions (4)

Lemma 4.1
Lemma 4.2
proof : Proof Sketch
Lemma 4.3

Exploration and Anti-Exploration with Distributional Random Network Distillation

TL;DR

Abstract

Exploration and Anti-Exploration with Distributional Random Network Distillation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (4)