Table of Contents
Fetching ...

Monte Carlo Tree Search with Boltzmann Exploration

Michael Painter, Mohamed Baioumy, Nick Hawes, Bruno Lacerda

TL;DR

This work addresses planning under uncertainty by analyzing Monte Carlo Tree Search (MCTS) with Boltzmann exploration. It identifies that the maximum-entropy objective in MENTS can misalign with reward maximization and introduces two algorithms, BTS and DENTS, that preserve Boltzmann exploration while ensuring convergence to the reward-maximizing policy; both leverage efficient Alias-based action sampling. Theoretical results establish simple-regret convergence for BTS and DENTS and quantify their consistency, while empirical results in gridworlds and the game Go demonstrate robust performance and practical speedups. Overall, BTS and DENTS offer simple, effective alternatives to UCT and MENTS with favorable exploration properties and real-world applicability in planning with simulators.

Abstract

Monte-Carlo Tree Search (MCTS) methods, such as Upper Confidence Bound applied to Trees (UCT), are instrumental to automated planning techniques. However, UCT can be slow to explore an optimal action when it initially appears inferior to other actions. Maximum ENtropy Tree-Search (MENTS) incorporates the maximum entropy principle into an MCTS approach, utilising Boltzmann policies to sample actions, naturally encouraging more exploration. In this paper, we highlight a major limitation of MENTS: optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective. We introduce two algorithms, Boltzmann Tree Search (BTS) and Decaying ENtropy Tree-Search (DENTS), that address these limitations and preserve the benefits of Boltzmann policies, such as allowing actions to be sampled faster by using the Alias method. Our empirical analysis shows that our algorithms show consistent high performance across several benchmark domains, including the game of Go.

Monte Carlo Tree Search with Boltzmann Exploration

TL;DR

This work addresses planning under uncertainty by analyzing Monte Carlo Tree Search (MCTS) with Boltzmann exploration. It identifies that the maximum-entropy objective in MENTS can misalign with reward maximization and introduces two algorithms, BTS and DENTS, that preserve Boltzmann exploration while ensuring convergence to the reward-maximizing policy; both leverage efficient Alias-based action sampling. Theoretical results establish simple-regret convergence for BTS and DENTS and quantify their consistency, while empirical results in gridworlds and the game Go demonstrate robust performance and practical speedups. Overall, BTS and DENTS offer simple, effective alternatives to UCT and MENTS with favorable exploration properties and real-world applicability in planning with simulators.

Abstract

Monte-Carlo Tree Search (MCTS) methods, such as Upper Confidence Bound applied to Trees (UCT), are instrumental to automated planning techniques. However, UCT can be slow to explore an optimal action when it initially appears inferior to other actions. Maximum ENtropy Tree-Search (MENTS) incorporates the maximum entropy principle into an MCTS approach, utilising Boltzmann policies to sample actions, naturally encouraging more exploration. In this paper, we highlight a major limitation of MENTS: optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective. We introduce two algorithms, Boltzmann Tree Search (BTS) and Decaying ENtropy Tree-Search (DENTS), that address these limitations and preserve the benefits of Boltzmann policies, such as allowing actions to be sampled faster by using the Alias method. Our empirical analysis shows that our algorithms show consistent high performance across several benchmark domains, including the game of Go.
Paper Structure (69 sections, 27 theorems, 139 equations, 34 figures, 19 tables)

This paper contains 69 sections, 27 theorems, 139 equations, 34 figures, 19 tables.

Key Result

Proposition 3.1

There exists an MDP $\mathcal{M}$ and temperature $\alpha$ such that $\mathbb{E}[\textnormal{reg}(s_0,\psi^n_{\textnormal{MENTS}})] \not\to 0$ as $n\to\infty$. That is, MENTS is not consistent.

Figures (34)

  • Figure 1: An illustration of the (modified) D-chain problem, where 1 is the starting state, transitions are deterministic and values next to states represent rewards for arriving in that state.
  • Figure 2: A comparison of MENTS, DENTS and UCT when run on the (modified) 10-chain.
  • Figure 3: Results for gridworld environments. Further results are given in Appendix \ref{['app:hps']}.
  • Figure 4: An example MDP to demonstrate the necessity for a decaying search temperature when using average returns.
  • Figure 5: An example of an alias table for a categorical distribution over four actions. To sample from the table we can draw a random index from $I\sim U(\{1,2,3,4\})$, sample a uniformly random number from $x\sim U(0,1)$ and then follow the pointer depending on if $x>\text{thresholds}[I]$ or not. For example, sampling either $(I,x)=(2,0.5)$ or $(I,x)=(3,0.1)$ would lead to the action $a_3$.
  • ...and 29 more figures

Theorems & Definitions (48)

  • Proposition 3.1
  • proof
  • Theorem 4.1
  • Theorem 4.2
  • Proposition B.1
  • Lemma E.1
  • Lemma E.2
  • Theorem E.3
  • proof
  • Lemma E.4
  • ...and 38 more