Table of Contents
Fetching ...

Extreme Value Monte Carlo Tree Search for Classical Planning

Masataro Asai, Stephen Wissow

TL;DR

The paper tackles the challenge of applying Monte Carlo Tree Search to domain-independent classical planning by aligning bandit assumptions with the actual properties of cost-to-go heuristics. It lever Peaks-Over-Threshold Extreme Value Theory to model the extrema of heuristic samples, deriving a new Uniform-bandit approach (UCB1-Uniform) and a corresponding MCTS variant (GreedyUCT-Uniform). The authors provide a formal regret bound for UCB1-Uniform and demonstrate, through extensive experiments, that it outperforms state-of-the-art baselines such as GBFS, Softmin-Type(h), GUCT-Normal2, and several Max-$k$ bandits across multiple benchmarks and heuristics. The results show substantial empirical gains with strong theoretical justification, suggesting EVT-based bandits as a principled direction for planning with uncertain, unbounded cost-to-go estimates. This work advances the integration of advanced statistical modeling into heuristic search, with practical impact on improving scalability and robustness of domain-independent planners.

Abstract

Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandits (MABs) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, which are unbounded in $\R$, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as $(-\infty,\infty)$, which we can narrow down. Second, Full Bellman backup (Schulte and Keller 2014), which backpropagates sample max/min, lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value Theory} to resolve both issues at once, and propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.

Extreme Value Monte Carlo Tree Search for Classical Planning

TL;DR

The paper tackles the challenge of applying Monte Carlo Tree Search to domain-independent classical planning by aligning bandit assumptions with the actual properties of cost-to-go heuristics. It lever Peaks-Over-Threshold Extreme Value Theory to model the extrema of heuristic samples, deriving a new Uniform-bandit approach (UCB1-Uniform) and a corresponding MCTS variant (GreedyUCT-Uniform). The authors provide a formal regret bound for UCB1-Uniform and demonstrate, through extensive experiments, that it outperforms state-of-the-art baselines such as GBFS, Softmin-Type(h), GUCT-Normal2, and several Max- bandits across multiple benchmarks and heuristics. The results show substantial empirical gains with strong theoretical justification, suggesting EVT-based bandits as a principled direction for planning with uncertain, unbounded cost-to-go estimates. This work advances the integration of advanced statistical modeling into heuristic search, with practical impact on improving scalability and robustness of domain-independent planners.

Abstract

Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandits (MABs) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, which are unbounded in , and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as , which we can narrow down. Second, Full Bellman backup (Schulte and Keller 2014), which backpropagates sample max/min, lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value Theory} to resolve both issues at once, and propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.
Paper Structure (33 sections, 13 theorems, 28 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 13 theorems, 28 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Given i.i.d. $x_1,\ldots,x_N\sim {\mathcal{N}}({\textnormal{x}}|\mu,\sigma)$ (i.e., ${\textnormal{x}}\sim{\mathcal{N}}(\mu,\sigma)$), the MLEs of $\mu$ and $\sigma$ are the empirical mean $\hat{\mu}=\frac{1}{N}\sum_i x_i$ and variance $\hat{\sigma}^2=\frac{1}{N-1}\sum_i (x_i-\hat{\mu})$. (Well-known

Figures (6)

  • Figure 1: Generalized pareto distribution $\mathrm{GP}(0,1,\xi)$.
  • Figure 2: Computing the average and the variance is seen as fitting ${\mathcal{N}}(\mu,\sigma)$; Computing the maximum and the shape of the tail distribution is seen as fitting $\mathrm{GP}(\mu,\sigma,\xi)$ with $\xi<0$.
  • Figure 3: Given equally informative plateaus, UCB1-Uniform focuses on one plateau to find an exit quickly.
  • Figure A4: The cumulative histogram of the number of problem instances solved ($y$-axis) below a certain number of node evaluations ($x$-axis, 10,000 nodes maximum). Each line represents a random seed. The total numbers at the limit differ from those in other plots (this result does not limit the expansions or the runtime).
  • Figure A5: The cumulative histogram of the number of problem instances solved ($y$-axis) below a certain number of node expansions ($x$-axis, 4,000 nodes maximum). Each line represents a random seed. The total numbers at the limit differ from those in other plots (this result does not limit the evaluations or the runtime).
  • ...and 1 more figures

Theorems & Definitions (30)

  • Definition 1
  • Definition 2: NECs
  • Definition 3: Full Bellman Backup
  • Definition 4: Monte Carlo Backup
  • Theorem 1
  • Definition 5
  • Definition 6
  • Theorem 2: CLT
  • Definition 7: Generalized Pareto Distribution
  • Theorem 3: Pickands--Balkema--de Haan theorem
  • ...and 20 more