Table of Contents
Fetching ...

Monte Carlo Tree Search in the Presence of Transition Uncertainty

Farnaz Kohankhaki, Kiarash Aghakasiri, Hongming Zhang, Ting-Han Wei, Chao Gao, Martin Müller

TL;DR

This work addresses decision-making when the environment model is imperfect by introducing Uncertainty Adapted MCTS (UA-MCTS), which learns a transition-uncertainty function and uses it to steer search away from unreliable transitions. It extends MCTS with four uncertainty-aware components (selection, expansion, simulation, backpropagation) and proves a completeness property, while UA-UCB demonstrates tighter regret bounds than standard UCB in corrupted settings. Empirically, UA-MCTS substantially improves performance on deterministic MinAtar games, especially when online uncertainty is learned, and it often approaches true-model planning despite model errors. A key finding is that learning a compact uncertainty model can outperform attempting to learn full transition corrections, guiding robust planning in imperfect environments with practical impact for real-world planning under model misspecification.

Abstract

Monte Carlo Tree Search (MCTS) is an immensely popular search-based framework used for decision making. It is traditionally applied to domains where a perfect simulation model of the environment is available. We study and improve MCTS in the context where the environment model is given but imperfect. We show that the discrepancy between the model and the actual environment can lead to significant performance degradation with standard MCTS. We therefore develop Uncertainty Adapted MCTS (UA-MCTS), a more robust algorithm within the MCTS framework. We estimate the transition uncertainty in the given model, and direct the search towards more certain transitions in the state space. We modify all four MCTS phases to improve the search behavior by considering these estimates. We prove, in the corrupted bandit case, that adding uncertainty information to adapt UCB leads to tighter regret bound than standard UCB. Empirically, we evaluate UA-MCTS and its individual components on the deterministic domains from the MinAtar test suite. Our results demonstrate that UA-MCTS strongly improves MCTS in the presence of model transition errors.

Monte Carlo Tree Search in the Presence of Transition Uncertainty

TL;DR

This work addresses decision-making when the environment model is imperfect by introducing Uncertainty Adapted MCTS (UA-MCTS), which learns a transition-uncertainty function and uses it to steer search away from unreliable transitions. It extends MCTS with four uncertainty-aware components (selection, expansion, simulation, backpropagation) and proves a completeness property, while UA-UCB demonstrates tighter regret bounds than standard UCB in corrupted settings. Empirically, UA-MCTS substantially improves performance on deterministic MinAtar games, especially when online uncertainty is learned, and it often approaches true-model planning despite model errors. A key finding is that learning a compact uncertainty model can outperform attempting to learn full transition corrections, guiding robust planning in imperfect environments with practical impact for real-world planning under model misspecification.

Abstract

Monte Carlo Tree Search (MCTS) is an immensely popular search-based framework used for decision making. It is traditionally applied to domains where a perfect simulation model of the environment is available. We study and improve MCTS in the context where the environment model is given but imperfect. We show that the discrepancy between the model and the actual environment can lead to significant performance degradation with standard MCTS. We therefore develop Uncertainty Adapted MCTS (UA-MCTS), a more robust algorithm within the MCTS framework. We estimate the transition uncertainty in the given model, and direct the search towards more certain transitions in the state space. We modify all four MCTS phases to improve the search behavior by considering these estimates. We prove, in the corrupted bandit case, that adding uncertainty information to adapt UCB leads to tighter regret bound than standard UCB. Empirically, we evaluate UA-MCTS and its individual components on the deterministic domains from the MinAtar test suite. Our results demonstrate that UA-MCTS strongly improves MCTS in the presence of model transition errors.
Paper Structure (20 sections, 3 theorems, 6 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 20 sections, 3 theorems, 6 equations, 5 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Here $C$ is a constant, $\beta_i = c^2 (1-\delta_i)^2$, $\Delta_i = \mu_{i^*} - \mu_{i}$ and $\hat{\Delta}_i = \hat{\mu}_{i^*} - \hat{\mu}_i$, assuming $0 \leq \delta_i \leq 1 - \sqrt{\frac{1}{2c^2}}.$

Figures (5)

  • Figure 1: Offline (vertical bars) and online scenarios for Space Invaders. Bars and shaded areas show $mean \pm std$ of rewards over 100 runs for the offline and 15 for the online scenarios. For the corrupted and true models (two-leftmost cases), the best $c$ is $2$ and $\sqrt{2}$ respectively. The best $c$ parameter chosen for each of the UA-MCTS algorithms (the remaining five cases) are [$2$, $2$, $1$, $0.5$, $\sqrt{2}$] from left to right. For the online scenario, we plot the moving average of the reward with a window size of 50.
  • Figure 2: Offline (vertical bars) and online scenarios for Freeway. Bars and shaded areas show $mean \pm std$ of rewards over 100 runs for the offline and 15 for the online scenarios. For the corrupted and true models (two-leftmost cases), the best $c$ is $0.5$ and $2$ respectively. The best $c$ parameter chosen for each of the UA-MCTS algorithms (the remaining five cases) are [$2$, $2$, $\sqrt{2}$, $2$, $\sqrt{2}$] from left to right. For the online scenario, we plot the moving average of the reward with a window size of 50.
  • Figure 3: Offline (vertical bars) and online scenarios for Breakout. Bars and shaded areas show $mean \pm std$ of rewards over 100 runs for the offline and 15 for the online scenarios. For the corrupted and true models (two-leftmost cases), the best $c$ is 2 and $\sqrt{2}$ respectively. The best $c$ parameter chosen for each of the UA-MCTS algorithms (the remaining five cases) are [$1$, $2$, $\sqrt{2}$, $2$, $\sqrt{2}$] from left to right. For the online scenario, we plot the moving average of the reward with a window size of 50.
  • Figure 4: 2-Way GridWorld Environment. White cells are empty and grey cells are walls.
  • Figure 5: Comparison between learning the transition function with different hidden layer (HL) sizes and UA-MCTS method. The average number of steps to the goal is over 30 runs. The exploration parameter $c$ for "True Model" and "Corrupted Model' is $2$ and $\sqrt{2}$ respectively. For the other scenarios $c$ is $\sqrt{2}$.

Theorems & Definitions (4)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • proof