Extreme Value Monte Carlo Tree Search for Classical Planning
Masataro Asai, Stephen Wissow
TL;DR
The paper tackles the challenge of applying Monte Carlo Tree Search to domain-independent classical planning by aligning bandit assumptions with the actual properties of cost-to-go heuristics. It lever Peaks-Over-Threshold Extreme Value Theory to model the extrema of heuristic samples, deriving a new Uniform-bandit approach (UCB1-Uniform) and a corresponding MCTS variant (GreedyUCT-Uniform). The authors provide a formal regret bound for UCB1-Uniform and demonstrate, through extensive experiments, that it outperforms state-of-the-art baselines such as GBFS, Softmin-Type(h), GUCT-Normal2, and several Max-$k$ bandits across multiple benchmarks and heuristics. The results show substantial empirical gains with strong theoretical justification, suggesting EVT-based bandits as a principled direction for planning with uncertain, unbounded cost-to-go estimates. This work advances the integration of advanced statistical modeling into heuristic search, with practical impact on improving scalability and robustness of domain-independent planners.
Abstract
Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandits (MABs) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, which are unbounded in $\R$, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as $(-\infty,\infty)$, which we can narrow down. Second, Full Bellman backup (Schulte and Keller 2014), which backpropagates sample max/min, lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value Theory} to resolve both issues at once, and propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.
