Exploration and Anti-Exploration with Distributional Random Network Distillation
Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li
TL;DR
This paper tackles bonus inconsistency in Random Network Distillation (RND) by introducing Distributional RND (DRND), which distills a distribution over multiple fixed random targets to produce more discriminative intrinsic rewards. DRND defines two complementary bonuses, $b_1(x)=\|f_\theta(x)-\mu(x)\|^2$ and $b_2(x)=\sqrt{\frac{[f_\theta(x)]^2-\mu(x)^2}{B_2(x)-\mu(x)^2}}$, and combines them as $b(x)=\alpha b_1(x)+(1-\alpha) b_2(x)$, where $\mu(x)$ and $B_2(x)$ are the first two moments of the target distribution; the predictor $f_\theta$ learns to approximate the distributional variable $c(x)$ with $L(\theta)=\|f_\theta(x)-c(x)\|^2$. The authors show that the DRND predictor acts as a pseudo-count estimator, with an unbiased statistic $y(x)=\frac{f_*^2(x)-\mu(x)^2}{B_2(x)-\mu(x)^2}$ for $1/n$ and vanishing variance as data accumulate. Empirically, DRND improves exploration in online Atari (Montezuma's Revenge, Gravitar, Venture), Adroit, and Fetch tasks, and enhances offline performance in D4RL with SAC-DRND, outperforming several baselines while remaining computationally efficient. Overall, DRND provides a principled, distributional extension to RND that couples exploration with pseudo-count dynamics, yielding robust improvements across online and offline RL settings.
Abstract
Exploration remains a critical issue in deep reinforcement learning for an agent to attain high returns in unknown environments. Although the prevailing exploration Random Network Distillation (RND) algorithm has been demonstrated to be effective in numerous environments, it often needs more discriminative power in bonus allocation. This paper highlights the "bonus inconsistency" issue within RND, pinpointing its primary limitation. To address this issue, we introduce the Distributional RND (DRND), a derivative of the RND. DRND enhances the exploration process by distilling a distribution of random networks and implicitly incorporating pseudo counts to improve the precision of bonus allocation. This refinement encourages agents to engage in more extensive exploration. Our method effectively mitigates the inconsistency issue without introducing significant computational overhead. Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. Our code is publicly available at https://github.com/yk7333/DRND.
