Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

Shaurya Dewan; Anisha Jain; Zoe LaLena; Lifan Yu

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

Shaurya Dewan, Anisha Jain, Zoe LaLena, Lifan Yu

TL;DR

The paper addresses unsupervised reinforcement learning across multiple environments by extending a baseline entropy-driven pre-training framework with five improvements, notably curiosity-driven exploration and trajectory-entropy-based sampling. It introduces PDF-based trajectory sampling, dynamic CVaR percentile $\\alpha$, a higher KL divergence threshold, a forward dynamics curiosity term, and $\\alpha$-percentile sampling over curiosity, evaluating their effects in Grid World and Ant environments. Results show that dynamic $\\alpha$ and elevated KL thresholds improve pre-training performance, while PDF sampling and $\\alpha$-percentile curiosity have limited or mixed gains; curiosity helps more in the high-dimensional Ant environment than in Grid World. The authors conclude that the integrated approach can boost performance in some settings and propose future work including inverse dynamics modeling, more extensive trajectory sampling, and adaptive KL strategies to further enhance exploration efficiency and robustness.

Abstract

The authors of 'Unsupervised Reinforcement Learning in Multiple environments' propose a method, alpha-MEPOL, to tackle unsupervised RL across multiple environments. They pre-train a task-agnostic exploration policy using interactions from an entire environment class and then fine-tune this policy for various tasks using supervision. We expanded upon this work, with the goal of improving performance. We primarily propose and experiment with five new modifications to the original work: sampling trajectories using an entropy-based probability distribution, dynamic alpha, higher KL Divergence threshold, curiosity-driven exploration, and alpha-percentile sampling on curiosity. Dynamic alpha and higher KL-Divergence threshold both provided a significant improvement over the baseline from the earlier work. PDF-sampling failed to provide any improvement due to it being approximately equivalent to the baseline method when the sample space is small. In high-dimensional environments, the addition of curiosity-driven exploration enhances learning by encouraging the agent to seek diverse experiences and explore the unknown more. However, its benefits are limited in low-dimensional and simpler environments where exploration possibilities are constrained and there is little that is truly unknown to the agent. Overall, some of our experiments did boost performance over the baseline and there are a few directions that seem promising for further research.

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

TL;DR

, a higher KL divergence threshold, a forward dynamics curiosity term, and

-percentile sampling over curiosity, evaluating their effects in Grid World and Ant environments. Results show that dynamic

and elevated KL thresholds improve pre-training performance, while PDF sampling and

-percentile curiosity have limited or mixed gains; curiosity helps more in the high-dimensional Ant environment than in Grid World. The authors conclude that the integrated approach can boost performance in some settings and propose future work including inverse dynamics modeling, more extensive trajectory sampling, and adaptive KL strategies to further enhance exploration efficiency and robustness.

Abstract

Paper Structure (17 sections, 5 equations, 7 figures)

This paper contains 17 sections, 5 equations, 7 figures.

Introduction
Previous Work
Improvements
Sampling Trajectories Using a PDF
Dynamic Alpha
Higher KL Divergence Threshold
Curiosity-Driven Exploration
$\alpha$-Percentile Sampling Over Curiosity
Results
Unsupervised Pre-Training Results
Grid World Environment Pre-Training
Ant Environment Pre-Training
Supervised Fine Turned Results
Grid World Environment Fine-tuning
Ant Environment Fine-tuning
...and 2 more sections

Figures (7)

Figure 1: Grid World Entropy Comparison Plots - In all of the above plots, the pink line/plot corresponds to the baseline model.
Figure 2: Grid World Exploration Heatmaps
Figure 3: Ant Entropy Comparison Plots - In all of the above plots, the orange line/plot corresponds to the baseline model.
Figure 4: Ant Exploration Heatmaps
Figure 5: Output frames of our model's fine-tuning process in Grid World
...and 2 more figures

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

TL;DR

Abstract

Curiosity & Entropy Driven Unsupervised RL in Multiple Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (7)