First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Ben Norman; Jeff Clune

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Ben Norman, Jeff Clune

TL;DR

By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of human-like exploration on a broader range of domains.

Abstract

Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. taking into account complex domain priors and adapting quickly based on previous exploration). Across episodes, RL agents struggle to perform even simple exploration strategies, for example systematic search that avoids exploring the same location multiple times. This poor exploration limits performance on challenging domains. Meta-RL is a potential solution, as unlike standard RL, meta-RL can learn to explore, and potentially learn highly complex strategies far beyond those of standard RL, strategies such as experimenting in early episodes to learn new skills, or conducting experiments to learn about the current environment. Traditional meta-RL focuses on the problem of learning to optimally balance exploration and exploitation to maximize the cumulative reward of the episode sequence (e.g., aiming to maximize the total wins in a tournament -- while also improving as a player). We identify a new challenge with state-of-the-art cumulative-reward meta-RL methods. When optimal behavior requires exploration that sacrifices immediate reward to enable higher subsequent reward, existing state-of-the-art cumulative-reward meta-RL methods become stuck on the local optimum of failing to explore. Our method, First-Explore, overcomes this limitation by learning two policies: one to solely explore, and one to solely exploit. When exploring requires forgoing early-episode reward, First-Explore significantly outperforms existing cumulative meta-RL methods. By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of human-like exploration on a broader range of domains.

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 20 figures, 9 tables)

This paper contains 30 sections, 5 equations, 20 figures, 9 tables.

Introduction
Background
Related Work
First-Explore
Experimental Setup
Training Method
Results
Limitations and Future Work
Discussion
Conclusion
Replicability
Training Pseudocode
Detailed Domains:
Bandits with One Fixed Arm
Dark Treasure-Rooms
...and 15 more sections

Figures (20)

Figure 1: First-Explore aims to maximize the cumulative reward of a sequence of $n$ episodes on a target environment distribution. This optimization is achieved by first training two separate policies, and then combining them after training to maximize the total reward obtained. A. First, two separate policies are trained on the distribution of environments: one to explore (produce informative episodes), and one to exploit (maximize current episode return). During training, the explore policy $\pi_{\text{explore}}$ provides all the context $c_i = \tau_1, \dots, \tau_i$ for both policies. This flow of context is visualized by solid arrows$\boldsymbol{\rightarrow}$. The exploit policy $\pi_{\text{exploit}}$ takes a context of episodes, and produces a single episode of exploitation. The return of this exploit episode is then used to train both policies, with the feedback to the explore policy visualized by the dotted green arrows$\mathrel{}$. B. After the two policies are trained, different combinations of them are evaluated to find the combination that maximizes total reward. Each combination involves first exploring for $k$ episodes, and then repeatedly exploiting for the remaining $n-k$ episodes. C. The best combination is then used at inference time: exploring for a fixed number of episodes on new environments, and then exploiting for the remaining episodes.
Figure 2: Mean performance (averaged across sampled bandits) of algorithms for deceptive (left) and non-deceptive (right) versions of the bandit domain. Each method trained 5 independent times, and each such run is plotted individually, so as to faithfully represent the variance between runs (e.g., that multiple of the bandit-domain RL$^2$ training runs achieve exactly the same reward). \ref{['dom_deet']} provides alternative plots with mean reward $\pm$ standard deviation. The top figures plot the cumulative reward against the number of arm pulls, while the bottom figures illustrate the reward dynamics by plotting the individual pull rewards against the same. When the domain is deceptive, the cumulative-reward meta-RL method, RL$^2$ (fuchsia), performs extremely poorly, despite the deceptive domain giving strictly higher rewards than the non-deceptive version. In contrast, First-Explore (green) impressively outperforms UCB (purple) and Thompson Sampling (orange) despite them being specialized bandit algorithms, in both the deceptive and non-deceptive settings, with $p < 10^{-5}$.
Figure 3: Mean performance (averaged across sampled treasure rooms) of algorithms for deceptive (left) and non-deceptive (right) versions of the Dark Treasure Room domain. Each method trained 5 independent times, and each such run is plotted individually. The top figures plot the cumulative reward obtained against step and episode number, while the bottom figures provide a proxy for exploration by plotting the number of times agents move against the same. When the domain is deceptive, the cumulative-reward meta-RL methods, RL$^2$ (fuchsia), HyperX (brown), and VariBAD (purple) achieve low total-reward, as the policies learnt to minimize exploration. In contrast, First-Explore (green) performs well on both the deceptive and non-deceptive domains.
Figure 4: Left: Raw agent observations from a sampled ray maze converted to an image. The agent receives the wall distances and the wall types. Portraying this numerical data as an image, goal locations are green, and the two wall orientations are distinguished (east-west teal, and north-south navy). To aid the eye, the floor has been coloured in dark purple, and the sky yellow. Although the goal is visible, it could be a treasure (positive reward) or trap (negative reward). Right: The image produced with direct ray casting (large number of processed lidar measurements) rather than the 15 the agent receives.
Figure 5: Mean performance averaged across 1000 Ray Mazes for five runs of each treatment. First-Explore strongly outperforms the SOTA meta-RL baselines on this complex environment, achieving a mean 0.47 reward, only slightly worse than the expected total-reward of behaving optimally, 0.64.
...and 15 more figures

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

TL;DR

Abstract

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Authors

TL;DR

Abstract

Table of Contents

Figures (20)