A Bayesian Approach to Online Planning

Nir Greshler; David Ben Eli; Carmel Rabinovitz; Gabi Guetta; Liran Gispan; Guy Zohar; Aviv Tamar

A Bayesian Approach to Online Planning

Nir Greshler, David Ben Eli, Carmel Rabinovitz, Gabi Guetta, Liran Gispan, Guy Zohar, Aviv Tamar

TL;DR

The paper tackles online planning under neural network uncertainty by formulating Bayesian tree search with a prior over complete trees and a posterior update from leaf observations. It introduces practical algorithms (Thompson sampling tree search and Bayes-UCB tree search) and demonstrates a finite-time Bayesian regret bound that scales with the prior entropy, connecting prior certainty to planning performance. The authors integrate neural network uncertainty estimation (MLE and ensembles) within a self-play, AlphaZero–style loop to learn posterior value distributions and target planning outcomes. Empirically, uncertainty-aware planning yields substantial gains when uncertainty estimates are accurate (especially with ground-truth uncertainty) but reveals that current learned uncertainty methods may be insufficient in some ProcGen tasks, highlighting the need for better uncertainty estimation in planning pipelines.

Abstract

The combination of Monte Carlo tree search and neural networks has revolutionized online planning. As neural network approximations are often imperfect, we ask whether uncertainty estimates about the network outputs could be used to improve planning. We develop a Bayesian planning approach that facilitates such uncertainty quantification, inspired by classical ideas from the meta-reasoning literature. We propose a Thompson sampling based algorithm for searching the tree of possible actions, for which we prove the first (to our knowledge) finite time Bayesian regret bound, and propose an efficient implementation for a restricted family of posterior distributions. In addition we propose a variant of the Bayes-UCB method applied to trees. Empirically, we demonstrate that on the ProcGen Maze and Leaper environments, when the uncertainty estimates are accurate but the neural network output is inaccurate, our Bayesian approach searches the tree much more effectively. In addition, we investigate whether popular uncertainty estimation methods are accurate enough to yield significant gains in planning. Our code is available at: https://github.com/nirgreshler/bayesian-online-planning.

A Bayesian Approach to Online Planning

TL;DR

Abstract

Paper Structure (29 sections, 7 theorems, 35 equations, 19 figures, 3 algorithms)

This paper contains 29 sections, 7 theorems, 35 equations, 19 figures, 3 algorithms.

Introduction
Bayesian Online Planning
Bayesian Tree Search
Practical Thompson Sampling Tree Search
Improved Exploration via Bayes-UCB
Action Commitment in Online Planning
Learning in Bayesian Tree Search
Related Work
Experiments
Results
Discussion
Bayes-UCB Tree Search Pseudo-code
Implementation Details
ProcGen Maze Environment
Neural Network Training Parameters
...and 14 more sections

Key Result

Theorem 1

The regret of the leaf selection rule defined in Eq. eq:TS_action_prob satisfies: $\mathbb{E}\left[\textrm{Regret}(T)\right] \leq H R_{max}\sqrt{\frac{1}{2}|\mathcal{Z}|\mathcal{H}(z^*)T}.$

Figures (19)

Figure 1: Example of value estimation errors during search
Figure 2: Illustration of the formulation in Section \ref{['sec:formulation']}. A tree $\mathcal{T}$ of depth $H=3$ is shown. Let the action $\{L,R\}$ correspond to the left and right transitions, respectively. Assume that the optimal branch is $(S_0,R)\to(S_1,L)\to(S_2,R)$. Then, $z^* = (S_2,R)$. At time $t=4$, the state-action pairs that have already been explored are marked in solid line, and the next state-action to be explored is $z_t = (S_1,R)$. The set $\mathcal{Z}_t$ is marked in purple. Note that $z^*$ is indicative of the optimal branch, and also of $z^*_t$, and of the optimal action at the root, $A^*$.
Figure 3: TSTS Algorithm Schematic. Plots (a) and (b) show two successive iterations of forward sampling, where states in $S_{\textrm{known}}$ are marked in gray. Subsequently, in Plot (c), state $s_2$ is added to $S_{\textrm{known}}$, and the max-backup routine is performed to update the posteriors.
Figure 4: Success rate of different planners on ProcGen maze. Left + Middle: deterministic planners. Right: stochastic planners. Error bars are over 6 neural networks obtained from independent training runs. See Section \ref{['ssec:experiments_maze']} for more details.
Figure 5: Ground truth uncertainty error and action commitment ablations, per Section \ref{['ssec:experiments_maze']} in the text.
...and 14 more figures

Theorems & Definitions (14)

Theorem 1
Example 1
Proposition 1
Proposition 2
proof
proof
Proposition 3
proof
Proposition 4
proof
...and 4 more

A Bayesian Approach to Online Planning

TL;DR

Abstract

A Bayesian Approach to Online Planning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (14)