Table of Contents
Fetching ...

Some Considerations on Learning to Explore via Meta-Reinforcement Learning

Bradly C. Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, Ilya Sutskever

TL;DR

This work addresses how meta-reinforcement learning agents can optimize their own data sampling to improve fast adaptation across tasks. It introduces E-MAML and E-RL^2, algorithms that explicitly account for how initial task samples influence post-adaptation returns, and discusses flexible inner-update operators and practical training tricks. The Krazy World benchmark and maze experiments demonstrate that exploration-aware meta-learning can achieve faster initial gains and stronger final performance than standard MAML or RL^2, with results highlighting the importance of system identification and memory. The findings point to future work on curiosity signals and intrinsic rewards to further enhance long-horizon exploration in meta-learning.

Abstract

We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: E-MAML and E-$\text{RL}^2$. Results are presented on a novel environment we call `Krazy World' and a set of maze environments. We show E-MAML and E-$\text{RL}^2$ deliver better performance on tasks where exploration is important.

Some Considerations on Learning to Explore via Meta-Reinforcement Learning

TL;DR

This work addresses how meta-reinforcement learning agents can optimize their own data sampling to improve fast adaptation across tasks. It introduces E-MAML and E-RL^2, algorithms that explicitly account for how initial task samples influence post-adaptation returns, and discusses flexible inner-update operators and practical training tricks. The Krazy World benchmark and maze experiments demonstrate that exploration-aware meta-learning can achieve faster initial gains and stronger final performance than standard MAML or RL^2, with results highlighting the importance of system identification and memory. The findings point to future work on curiosity signals and intrinsic rewards to further enhance long-horizon exploration in meta-learning.

Abstract

We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: E-MAML and E-. Results are presented on a novel environment we call `Krazy World' and a set of maze environments. We show E-MAML and E- deliver better performance on tasks where exploration is important.

Paper Structure

This paper contains 21 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Three example worlds drawn from the task distribution. A good agent should first complete a successful system identification before exploiting. For example, in the leftmost grid the agent should identify the following: 1) the orange squares give +1 reward, 2) the blue squares replenish energy, 3) the gold squares block progress, 4) the black square can only be passed by picking up the pink key, 5) the brown squares will kill it, 6) it will slide over the purple squares. The center and right worlds show how these dynamics will change and need to be re-identified every time a new task is sampled.
  • Figure 2: One example maze environment rendered in human readable format. The agent attempts to find a goal within the maze.
  • Figure 3: Meta learning curves on Krazy World. We see that E-$\text{RL}^2$ is at achieves the best final results, but has the highest initial variance. Crucially, E-MAML converges faster than MAML, although both algorithms do manage to converge. $\text{RL}^2$ has relatively poor performance and high variance. A random agent achieves a score of around 0.05 on this task.
  • Figure 4: Meta learning curves on mazes. Figure \ref{['fig:gaps']} shows each curve in isolation, making it easier to discern their individual characteristics. E-MAML and E-$\text{RL}^2$ perform better than their counterparts.
  • Figure 5: Gap between initial performance and performance after one update. All algorithms show some level of improvement after one update. This suggests meta learning is working, because normal policy gradient methods learn nothing after one update.
  • ...and 2 more figures