Table of Contents
Fetching ...

Exploration and Anti-Exploration with Distributional Random Network Distillation

Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li

TL;DR

This paper tackles bonus inconsistency in Random Network Distillation (RND) by introducing Distributional RND (DRND), which distills a distribution over multiple fixed random targets to produce more discriminative intrinsic rewards. DRND defines two complementary bonuses, $b_1(x)=\|f_\theta(x)-\mu(x)\|^2$ and $b_2(x)=\sqrt{\frac{[f_\theta(x)]^2-\mu(x)^2}{B_2(x)-\mu(x)^2}}$, and combines them as $b(x)=\alpha b_1(x)+(1-\alpha) b_2(x)$, where $\mu(x)$ and $B_2(x)$ are the first two moments of the target distribution; the predictor $f_\theta$ learns to approximate the distributional variable $c(x)$ with $L(\theta)=\|f_\theta(x)-c(x)\|^2$. The authors show that the DRND predictor acts as a pseudo-count estimator, with an unbiased statistic $y(x)=\frac{f_*^2(x)-\mu(x)^2}{B_2(x)-\mu(x)^2}$ for $1/n$ and vanishing variance as data accumulate. Empirically, DRND improves exploration in online Atari (Montezuma's Revenge, Gravitar, Venture), Adroit, and Fetch tasks, and enhances offline performance in D4RL with SAC-DRND, outperforming several baselines while remaining computationally efficient. Overall, DRND provides a principled, distributional extension to RND that couples exploration with pseudo-count dynamics, yielding robust improvements across online and offline RL settings.

Abstract

Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND.

Exploration and Anti-Exploration with Distributional Random Network Distillation

TL;DR

This paper tackles bonus inconsistency in Random Network Distillation (RND) by introducing Distributional RND (DRND), which distills a distribution over multiple fixed random targets to produce more discriminative intrinsic rewards. DRND defines two complementary bonuses, and , and combines them as , where and are the first two moments of the target distribution; the predictor learns to approximate the distributional variable with . The authors show that the DRND predictor acts as a pseudo-count estimator, with an unbiased statistic for and vanishing variance as data accumulate. Empirically, DRND improves exploration in online Atari (Montezuma's Revenge, Gravitar, Venture), Adroit, and Fetch tasks, and enhances offline performance in D4RL with SAC-DRND, outperforming several baselines while remaining computationally efficient. Overall, DRND provides a principled, distributional extension to RND that couples exploration with pseudo-count dynamics, yielding robust improvements across online and offline RL settings.

Abstract

Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND.
Paper Structure (34 sections, 3 theorems, 22 equations, 14 figures, 18 tables, 2 algorithms)

This paper contains 34 sections, 3 theorems, 22 equations, 14 figures, 18 tables, 2 algorithms.

Key Result

Lemma 4.1

Let $\tilde{\theta}$ and $\bar{\theta}_i, i = 1, 2, \ldots, N$ be i.i.d. samples from $p(\theta)$. Given the linear model $f_\theta(x) = \theta^T x$, the expected mean squared error is where $\Sigma$ is the variance of $p(\theta)$.

Figures (14)

  • Figure 1: Bonus Heatmap of dataset distribution and RND bonus. The left image illustrates the dataset distribution, the middle image represents the RND bonus before training, and the right image represents the RND bonus after training. A more detailed change process is in Appendix \ref{['rndgif']}. Ideally, we aim for a uniform bonus distribution before any training and without exposure to the dataset. After extensive training, the expected bonus should inversely correlate with the dataset distribution. The bonus distribution of RND is inconsistent with the desired distribution, indicating a problem with bonus inconsistency. The details of the experiment settings can be found in Appendix \ref{['app_exp_detail']}
  • Figure 2: Diagram of RND and DRND. Compared to the RND method that only distills a fixed target network, our method distills a randomly distributed target network and utilizes statistical metrics to assign a bonus to each state.
  • Figure 3: Distribution of DRND bonus. The dataset distribution is the same as Figure \ref{['bonus inconsistency']}. These illustrations depict the distribution of the DRND bonus, including the first bonus and the second bonus. The first bonus is predominant before training, and the second bonus becomes more prominent after training.
  • Figure 4: Inconsistency experiments mentioned in Section \ref{['sec: inconsistency exp']}. We plot the intrinsic reward distribution of RND and DRND before and after training on a mini-dataset. Left: the box plot of the difference between the maximum and minimum intrinsic rewards over 10 independent runs before training. Right: the intrinsic rewards for each data point after training.
  • Figure 5: Mean episodic return of DRND method, RND method, and baseline PPO method on three Atari games. All curves are averaged over 5 runs.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Lemma 4.1
  • Lemma 4.2
  • proof : Proof Sketch
  • Lemma 4.3