Table of Contents
Fetching ...

Amplifying Exploration in Monte-Carlo Tree Search by Focusing on the Unknown

Cedric Derstroff, Jannis Brugger, Jannis Blüml, Mira Mezini, Stefan Kramer, Kristian Kersting

TL;DR

This work tackles the inefficiency of Monte-Carlo Tree Search (MCTS) in large trees where the algorithm revisits already explored regions. It introduces AmEx-MCTS, a decoupled formulation that separates value updates, visit counts, and the chosen path, and utilizes not-completely-explored-subtrees (nces) along with action selectors $a_{max}$ and $a_{select}$ to ignore fully explored regions while preserving MCTS principles; a variant AmÆx-MCTS further replaces the mean with a max in the UCT update. Theoretical analysis shows convergence to exhaustive search in the limit and preservation of UCT guarantees, while empirical results on three deterministic single-player domains (Chain, ChainLoop, and deterministic FrozenLake) demonstrate substantially broader search coverage and superior performance over classical MCTS and MCTS-T. These findings indicate significant efficiency gains for large-scale planning and single-player decision problems, suggesting practical impact for real-time and complex problem solving. The work also lays a foundation for future integration with neural components and end-to-end planning approaches in domains such as game endgames, chemistry, and materials design.

Abstract

Monte-Carlo tree search (MCTS) is an effective anytime algorithm with a vast amount of applications. It strategically allocates computational resources to focus on promising segments of the search tree, making it a very attractive search algorithm in large search spaces. However, it often expends its limited resources on reevaluating previously explored regions when they remain the most promising path. Our proposed methodology, denoted as AmEx-MCTS, solves this problem by introducing a novel MCTS formulation. Central to AmEx-MCTS is the decoupling of value updates, visit count updates, and the selected path during the tree search, thereby enabling the exclusion of already explored subtrees or leaves. This segregation preserves the utility of visit counts for both exploration-exploitation balancing and quality metrics within MCTS. The resultant augmentation facilitates in a considerably broader search using identical computational resources, preserving the essential characteristics of MCTS. The expanded coverage not only yields more precise estimations but also proves instrumental in larger and more complex problems. Our empirical evaluation demonstrates the superior performance of AmEx-MCTS, surpassing classical MCTS and related approaches by a substantial margin.

Amplifying Exploration in Monte-Carlo Tree Search by Focusing on the Unknown

TL;DR

This work tackles the inefficiency of Monte-Carlo Tree Search (MCTS) in large trees where the algorithm revisits already explored regions. It introduces AmEx-MCTS, a decoupled formulation that separates value updates, visit counts, and the chosen path, and utilizes not-completely-explored-subtrees (nces) along with action selectors and to ignore fully explored regions while preserving MCTS principles; a variant AmÆx-MCTS further replaces the mean with a max in the UCT update. Theoretical analysis shows convergence to exhaustive search in the limit and preservation of UCT guarantees, while empirical results on three deterministic single-player domains (Chain, ChainLoop, and deterministic FrozenLake) demonstrate substantially broader search coverage and superior performance over classical MCTS and MCTS-T. These findings indicate significant efficiency gains for large-scale planning and single-player decision problems, suggesting practical impact for real-time and complex problem solving. The work also lays a foundation for future integration with neural components and end-to-end planning approaches in domains such as game endgames, chemistry, and materials design.

Abstract

Monte-Carlo tree search (MCTS) is an effective anytime algorithm with a vast amount of applications. It strategically allocates computational resources to focus on promising segments of the search tree, making it a very attractive search algorithm in large search spaces. However, it often expends its limited resources on reevaluating previously explored regions when they remain the most promising path. Our proposed methodology, denoted as AmEx-MCTS, solves this problem by introducing a novel MCTS formulation. Central to AmEx-MCTS is the decoupling of value updates, visit count updates, and the selected path during the tree search, thereby enabling the exclusion of already explored subtrees or leaves. This segregation preserves the utility of visit counts for both exploration-exploitation balancing and quality metrics within MCTS. The resultant augmentation facilitates in a considerably broader search using identical computational resources, preserving the essential characteristics of MCTS. The expanded coverage not only yields more precise estimations but also proves instrumental in larger and more complex problems. Our empirical evaluation demonstrates the superior performance of AmEx-MCTS, surpassing classical MCTS and related approaches by a substantial margin.
Paper Structure (21 sections, 4 equations, 7 figures, 4 algorithms)

This paper contains 21 sections, 4 equations, 7 figures, 4 algorithms.

Figures (7)

  • Figure 1: Improving MCTS by ignoring already explored subtrees and leaves by focusing on the unknown. Updating the search strategy within MCTS by separating "incrementing visit counts" (displayed in blue) from the selected path (displayed in green) to explore more while keeping the number of iterations $n_\mathit{sims}$ the same.
  • Figure 2: Ignoring fully explored subtrees within the search. AmEx-MCTS introduces two new parameters $a_\mathit{max}$ and $a_\mathit{select}$ to differentiate between fully explored subtrees and such who are not. Selecting $a_\mathit{select}$, ignores the already fully explored left subtree, while $a_\mathit{max}$ would only lead in already known states. $N_p$ describes how often the node was visited within the search, $N_c$ the number how often this node had the highest UCT value among its siblings and $Q$ the value of the state.
  • Figure 3: Updating the visit counts $\boldsymbol{N_c}$ along the original MCTS, to keep them representative for the output policy. Within the backpropagation step of MCTS we separate the visit count updated along the selected path (green) and the visit counts which would be selected by the original MCTS algorithm (blue). The Q-values are updated along the selected path.
  • Figure 4: Our approach dominates on the deterministic FrozenLake environment as proposed by moerland_monte_2018. Higher values are better, where $1$ is the maximal return possible. The results reported are an average of 25 random seeds. Baseline results are taken from moerland_monte_2018.
  • Figure 5: Our approach strongly outperforms the baselines on the Chain environment as used in moerland_monte_2018. Higher values are better, where $1$ is the maximal return possible. The results reported are an average of 25 random seeds. Baseline results are taken from moerland_monte_2018.
  • ...and 2 more figures