Random Latent Exploration for Deep Reinforcement Learning

Srinath Mahankali; Zhang-Wei Hong; Ayush Sekhari; Alexander Rakhlin; Pulkit Agrawal

Random Latent Exploration for Deep Reinforcement Learning

Srinath Mahankali, Zhang-Wei Hong, Ayush Sekhari, Alexander Rakhlin, Pulkit Agrawal

TL;DR

This work tackles the exploration problem in deep reinforcement learning by introducing Random Latent Exploration (RLE), which conditions policies on randomly sampled latent vectors $\boldsymbol{z}$ drawn from a fixed distribution $P_{\boldsymbol{z}}$. A state-dependent randomized reward $F(s,\boldsymbol{z})=\phi(s)\cdot \boldsymbol{z}$ is used, and both the policy $\pi(.|s,\boldsymbol{z})$ and value function $V^{\pi}(s,\boldsymbol{z})$ are conditioned on $\boldsymbol{z}$, with $\boldsymbol{z}$ resampled at the start of each trajectory. The method serves as a simple plug-in for PPO and demonstrates improved, deeper exploration across Atari and Isaac Gym benchmarks, as evidenced by higher aggregated scores and more diverse trajectories, while ablations confirm robustness to latent distribution and vector dimension. Although Montezuma’s Revenge remains challenging, the results indicate that random latent rewards can outperform traditional noise-based and some bonus-based strategies on a wide range of tasks, offering a scalable, general approach to exploration in deep RL.

Abstract

We introduce Random Latent Exploration (RLE), a simple yet effective exploration strategy in reinforcement learning (RL). On average, RLE outperforms noise-based methods, which perturb the agent's actions, and bonus-based exploration, which rewards the agent for attempting novel behaviors. The core idea of RLE is to encourage the agent to explore different parts of the environment by pursuing randomly sampled goals in a latent space. RLE is as simple as noise-based methods, as it avoids complex bonus calculations but retains the deep exploration benefits of bonus-based methods. Our experiments show that RLE improves performance on average in both discrete (e.g., Atari) and continuous control tasks (e.g., Isaac Gym), enhancing exploration while remaining a simple and general plug-in for existing RL algorithms. Project website and code: https://srinathm1359.github.io/random-latent-exploration

Random Latent Exploration for Deep Reinforcement Learning

TL;DR

This work tackles the exploration problem in deep reinforcement learning by introducing Random Latent Exploration (RLE), which conditions policies on randomly sampled latent vectors

drawn from a fixed distribution

. A state-dependent randomized reward

is used, and both the policy

and value function

are conditioned on

, with

resampled at the start of each trajectory. The method serves as a simple plug-in for PPO and demonstrates improved, deeper exploration across Atari and Isaac Gym benchmarks, as evidenced by higher aggregated scores and more diverse trajectories, while ablations confirm robustness to latent distribution and vector dimension. Although Montezuma’s Revenge remains challenging, the results indicate that random latent rewards can outperform traditional noise-based and some bonus-based strategies on a wide range of tasks, offering a scalable, general approach to exploration in deep RL.

Abstract

Paper Structure (45 sections, 8 equations, 28 figures, 5 tables, 2 algorithms)

This paper contains 45 sections, 8 equations, 28 figures, 5 tables, 2 algorithms.

Introduction
Preliminaries
Reinforcement Learning (RL).
Our Method: Random Latent Exploration
Algorithmic Implementation
Experiments
Illustrative Experiments on FourRoom
Benchmarking Results on Atari
Evaluation on Isaac Gym
Ablation Studies
Choices of network architecture for the random reward network.
White noise random rewards.
Related Works
Discussion and Conclusion
$z$-sampling.
...and 30 more sections

Figures (28)

Figure 1: FourRoom environment. The agent starts at the top-right state (denoted by red 'S') and can move left, right, up, and down. The black bars denote walls that block the agent's movement.
Figure 2: Rollout of multiple trajectories from a policy trained with RLE in the middle of the training (1.5 million timesteps), where each color denotes a distinct trajectory. As the figure demonstrates, changing the latent vector $\boldsymbol{z}$ in RLE leads to diverse trajectories across all four rooms.
Figure 3: State visitation counts of all the methods after training for 2.5M timesteps without any task reward (reward-free exploration). The start location is represented by the red 'S' at the top right. RLE achieves much wider state visitation coverage over the course of training compared to other baselines, confirming that the diverse trajectories generated by the policy are useful for exploration.
Figure 4: Aggregated human normalized score across all 57 ATARI games. RLE exhibits a higher interquartile mean (IQM) of normalized score than PPO across 57 ATARI games, showing that RLE improves over PPO in the majority of tasks.
Figure 5: (a) Probability of improvement (POI) of our method, RLE, over the baselines NoisyNet, RND and PPO across all $57$Atari games (higher is better). The lower confidence bound of RLE's POI over the other algorithms are all greater than $0.5$. This means that RLE statistically improves over other algorithms agarwal2021deep. (b) Probability of improvement of RLE over the baselines RND, PPO, and PPO with reward normalization across all $9$IsaacGym tasks. In this domain as well, the lower confidence bound of RLE's POI over the other algorithms are all greater than $0.5$. This means RLE statistically improves over the other algorithms.
...and 23 more figures

Random Latent Exploration for Deep Reinforcement Learning

TL;DR

Abstract

Random Latent Exploration for Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (28)