Epistemic Monte Carlo Tree Search

Yaniv Oren; Villiam Vadocz; Matthijs T. J. Spaan; Wendelin Böhmer

Epistemic Monte Carlo Tree Search

Yaniv Oren, Villiam Vadocz, Matthijs T. J. Spaan, Wendelin Böhmer

TL;DR

Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

Abstract

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language {\sc subleq}, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

Epistemic Monte Carlo Tree Search

TL;DR

Abstract

Paper Structure (35 sections, 1 theorem, 26 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 1 theorem, 26 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Background
Monte Carlo Tree Search
Quantifying Uncertainty in Deep Reinforcement Learning
Deep Exploration with Upper Confidence Bounds
Deep Exploration with Epistemic MCTS
Search with a Learned Reward Model
Planning for Exploration with Epistemic Search
Propagating Epistemic Uncertainty in Search
Search with a Learned Transition Model
Related Work
Experiments
subleq Experiments
Deep Sea Experiments
Conclusions
...and 20 more sections

Key Result

Theorem 1

For $\hat{M}$, $\mathcal{M}$, $Q^*$, $Q^\pi_{\hat{M}}$ defined as above and $\delta \in (0, 1]$:

Figures (5)

Figure 1: Sample efficiency. Left: the easy subleqNegate Positives task. Right: the harder Identity Function task. Mean of 15 seeds, two standard errors.
Figure 2: Left: Scaling to growing Deep Sea sizes, 5 seeds per point. Only 2 seeds of MZ+UBE were able to solve size 40, and none size 50 within the training budget, both marked with an X. Middle: Stochastic-reward Deep Sea 50x50, 10 seeds. Right: The effect of the exploration parameter in Deep Sea 30x30, 3 seeds per point. Mean and standard error.
Figure 3: Heat maps over states in DeepSea 40 by 40 at different times (columns) during an example training run of EMCTS with an AZ transition model. Upper row: value uncertainty at the EMCTS root node. Middle row: single prediction of UBE at each state. Lower row: inverse visitation counts as reliable local uncertainty, where score of 2.0 represents unvisited.
Figure 4: Deep Sea 40x40, mean and standard error for 20 seeds. Rows: Different transition models. Left: episodic return in evaluation vs. environment steps. Right: exploration rate (number of discovered states vs. environment steps).
Figure 5: Mean and two standard errors for 10 seeds.

Theorems & Definitions (1)

Theorem 1

Epistemic Monte Carlo Tree Search

TL;DR

Abstract

Epistemic Monte Carlo Tree Search

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)