Table of Contents
Fetching ...

Epistemic Monte Carlo Tree Search

Yaniv Oren, Villiam Vadocz, Matthijs T. J. Spaan, Wendelin Böhmer

TL;DR

Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

Abstract

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language {\sc subleq}, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

Epistemic Monte Carlo Tree Search

TL;DR

Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

Abstract

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language {\sc subleq}, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.
Paper Structure (35 sections, 1 theorem, 26 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 1 theorem, 26 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

For $\hat{M}$, $\mathcal{M}$, $Q^*$, $Q^\pi_{\hat{M}}$ defined as above and $\delta \in (0, 1]$:

Figures (5)

  • Figure 1: Sample efficiency. Left: the easy subleqNegate Positives task. Right: the harder Identity Function task. Mean of 15 seeds, two standard errors.
  • Figure 2: Left: Scaling to growing Deep Sea sizes, 5 seeds per point. Only 2 seeds of MZ+UBE were able to solve size 40, and none size 50 within the training budget, both marked with an X. Middle: Stochastic-reward Deep Sea 50x50, 10 seeds. Right: The effect of the exploration parameter in Deep Sea 30x30, 3 seeds per point. Mean and standard error.
  • Figure 3: Heat maps over states in DeepSea 40 by 40 at different times (columns) during an example training run of EMCTS with an AZ transition model. Upper row: value uncertainty at the EMCTS root node. Middle row: single prediction of UBE at each state. Lower row: inverse visitation counts as reliable local uncertainty, where score of 2.0 represents unvisited.
  • Figure 4: Deep Sea 40x40, mean and standard error for 20 seeds. Rows: Different transition models. Left: episodic return in evaluation vs. environment steps. Right: exploration rate (number of discovered states vs. environment steps).
  • Figure 5: Mean and two standard errors for 10 seeds.

Theorems & Definitions (1)

  • Theorem 1