Table of Contents
Fetching ...

Polynomial Regret Concentration of UCB for Non-Deterministic State Transitions

Can Cömer, Jannis Blüml, Cedric Derstroff, Kristian Kersting

TL;DR

This paper derives polynomial regret concentration bounds for the Upper Confidence Bound algorithm in multi-armed bandit problems with stochastic transitions, offering improved theoretical guarantees and broadens the applicability of MCTS to real-world decision-making problems with probabilistic outcomes, such as in autonomous systems and financial decision-making.

Abstract

Monte Carlo Tree Search (MCTS) has proven effective in solving decision-making problems in perfect information settings. However, its application to stochastic and imperfect information domains remains limited. This paper extends the theoretical framework of MCTS to stochastic domains by addressing non-deterministic state transitions, where actions lead to probabilistic outcomes. Specifically, building on the work of Shah et al. (2020), we derive polynomial regret concentration bounds for the Upper Confidence Bound algorithm in multi-armed bandit problems with stochastic transitions, offering improved theoretical guarantees. Our primary contribution is proving that these bounds also apply to non-deterministic environments, ensuring robust performance in stochastic settings. This broadens the applicability of MCTS to real-world decision-making problems with probabilistic outcomes, such as in autonomous systems and financial decision-making.

Polynomial Regret Concentration of UCB for Non-Deterministic State Transitions

TL;DR

This paper derives polynomial regret concentration bounds for the Upper Confidence Bound algorithm in multi-armed bandit problems with stochastic transitions, offering improved theoretical guarantees and broadens the applicability of MCTS to real-world decision-making problems with probabilistic outcomes, such as in autonomous systems and financial decision-making.

Abstract

Monte Carlo Tree Search (MCTS) has proven effective in solving decision-making problems in perfect information settings. However, its application to stochastic and imperfect information domains remains limited. This paper extends the theoretical framework of MCTS to stochastic domains by addressing non-deterministic state transitions, where actions lead to probabilistic outcomes. Specifically, building on the work of Shah et al. (2020), we derive polynomial regret concentration bounds for the Upper Confidence Bound algorithm in multi-armed bandit problems with stochastic transitions, offering improved theoretical guarantees. Our primary contribution is proving that these bounds also apply to non-deterministic environments, ensuring robust performance in stochastic settings. This broadens the applicability of MCTS to real-world decision-making problems with probabilistic outcomes, such as in autonomous systems and financial decision-making.

Paper Structure

This paper contains 93 sections, 5 theorems, 19 equations, 5 figures.

Key Result

Theorem 1

For a non-deterministic, non-stationary MAB satisfying properties 1 (Convergence) and 2 (Concentration), the value $\mathrlap{\space\overline{\hbox{[}1]{\space}}}X_n=\frac{1}{n} \sum_{i=1}^K \sum_{j=1}^{K_i} T^i_j(T_i(n)) \mathrlap{\space\overline{\hbox{[}1]{\space}}}X^i_{j, T^i_j(T_i(n))},$ obt

Figures (5)

  • Figure 1: Running MCTS on $4 \times 4$ FrozenLake: The higher $n$, the better the performance. Is this by accident? No, we establish a polynomial regret concentration for running MCTS on stochastic and imperfect information domains.
  • Figure 2: Illustration of non-deterministic, non-stationary Multi-Armed Bandit problem. The upper confidence bound algorithm selects an action within the first layer, and a random transition is applied within the second.
  • Figure 3: As predicted by \ref{['thm:main']}, in FrozenLake, the regret decreases and shows greater concentration as $n$ grows.
  • Figure 4: A non-deterministic transition layer where the agent transitions to state $j \in [\widetilde{K}]$ with probability $\Tilde{p}_j$ to receive reward $Y_{j,\widetilde{T}_j(s)}$.
  • Figure 5: A screenshot of the FrozenLake environment showing the $4 \times 4$ map with the player at the start state, goal (present) and 4 holes.

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • proof
  • Theorem 3
  • Lemma 1
  • proof
  • proof
  • proof : Proof of \ref{['thm:main']}
  • Lemma 2
  • proof