Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

Stephen Wissow; Masataro Asai

Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

Stephen Wissow, Masataro Asai

TL;DR

This work addresses agile classical planning by reframing exploration-exploitation balance as a MAB problem with nonstandard reward scales. It identifies fundamental issues with applying UCB1 in planning and introduces a variance-aware Gaussian bandit, UCB1-Normal2, paired with GreedyUCT-Normal2 to adapt exploration to reward dispersion. Empirical results across IPC2018 satisficing domains show substantial improvements over GBFS and prior MCTS-based planners, with more plans found using fewer node expansions and competitive runtime performance. The study provides theoretical and empirical justification for variance-aware exploration in discrete planning and outlines how to integrate complementary enhancements such as preferred operators and deferred evaluation for further gains.

Abstract

Balancing exploration and exploitation has been an important problem in both game tree search and automated planning. However, while the problem has been extensively analyzed within the Multi-Armed Bandit (MAB) literature, the planning community has had limited success when attempting to apply those results. We show that a more detailed theoretical understanding of MAB literature helps improve existing planning algorithms that are based on Monte Carlo Tree Search (MCTS) / Trial Based Heuristic Tree Search (THTS). In particular, THTS uses UCB1 MAB algorithms in an ad hoc manner, as UCB1's theoretical requirement of fixed bounded support reward distributions is not satisfied within heuristic search for classical planning. The core issue lies in UCB1's lack of adaptations to the different scales of the rewards. We propose GreedyUCT-Normal, a MCTS/THTS algorithm with UCB1-Normal bandit for agile classical planning, which handles distributions with different scales by taking the reward variance into consideration, and resulted in an improved algorithmic performance (more plans found with less node expansions) that outperforms Greedy Best First Search and existing MCTS/THTS-based algorithms (GreedyUCT,GreedyUCT*).

Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

TL;DR

Abstract

Paper Structure (29 sections, 9 theorems, 29 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 9 theorems, 29 equations, 14 figures, 5 tables, 1 algorithm.

Background
Multi-Armed Bandit (MAB)
Forward Heuristic Best-First Search
Base MCTS for Graph Search
Existing MCTS-based Classical Planning
Bandit for Unbounded Distributions
Experimental Evaluation
Detailed Ablation
GUCT-Normal2
Simple Regret
Preferred Operators
Deferred Evaluation
Diversified Search
Solution Quality
Runtime Comparison
...and 14 more sections

Key Result

Theorem 1

UCB1-Normal has a logarithmic regret-per-arm $256\frac{\sigma_i^2 \log T}{\Delta_i^2}+1+\frac{\pi^2}{2}+8\log T$if, for a Student's $t$ RV $X$ with $s$ degrees of freedom (DOF), $\forall a \in [0, \sqrt{2(s+1)}]; P(X\geq a)\leq e^{-a^2/4}$, and if, for a $\chi^2$ RV $X$ with $s$ DOF, $P(X\geq 4s)\le

Figures (14)

Figure 1: Search behavior. Larger $\sigma$ assigns more probability on the lower $h$ values we get in the next expansion.
Figure 2: Comparing solution length of GUCT[-01/Normal2] and Softmin-Type(h) ($x$-axis) against GBFS ($y$-axis) using $h\mathrm{FF}\xspace \text{FF}\xspace$.
Figure 3: The cumulative histogram of the number of problem instances solved ($y$-axis) below a certain number of node evaluations ($x$-axis, 10,000 nodes maximum). Each line represents a random seed. In algorithms with an exploration coefficient hyperparameter, we use $c=1.0$. The total numbers at the limit differ from those in other plots (this result does not limit the expansions or the runtime).
Figure 4: The cumulative histogram of the number of problem instances solved ($y$-axis) below a certain number of node expansions ($x$-axis, 4,000 nodes maximum). Each line represents a random seed. In algorithms with an exploration coefficient hyperparameter, we use $c=1.0$. The total numbers at the limit differ from those in other plots (this result does not limit the evaluations or the runtime).
Figure 5: The cumulative histogram of the number of problem instances solved ($y$-axis) below a certain runtime ($x$-axis, 300 seconds maximum). Each line represents a random seed. In algorithms with an exploration coefficient hyperparameter, we use $c=1.0$. The total numbers at the limit differ from those in other plots (this result does not limit the evaluations or the expansion). GBFS shows a unique slowdown, potentially due to the suboptimal heap-based open list implementation in Pyperplan, which is not necessarily a representative performance of GBFS in general (e.g., in Fast Downward, it can be implemented as a bucket-based open list). To reject this hypothesis, we also implemented a GBFS using bucket-based queue in Pyperplan, whose results are shown as GBFSbucket. The results indicate that GBFSbucket and GBFS has a similar runtime curve, indicating that the unique curve of GBFS is not due to the efficiency of open list insertion.
...and 9 more figures

Theorems & Definitions (17)

Definition 1
Theorem 1: From auer2002finite
Theorem 2: Main Result
proof
Definition 2
Theorem 3
proof
Definition 3
Corollary 1
Theorem 4
...and 7 more

Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

TL;DR

Abstract

Scale-Adaptive Balancing of Exploration and Exploitation in Classical Planning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (17)