Table of Contents
Fetching ...

Super-Exponential Regret for UCT, AlphaGo and Variants

Laurent Orseau, Remi Munos

TL;DR

This work tightens lower bounds on regret for UCT and its variants on the $D$-chain, showing super-exponential (tower-like) growth in depth $D$. By correcting an oversight for rewards in $[0,1]$ and extending the proofs to AlphaGo/AlphaZero-style MCTS, the authors demonstrate that both Polynomial UCT and standard UCT, as well as AlphaZero-family search, incur immense regret on deceptive search graphs. The bounds take the form of nested exponentials in $D$ (e.g., $T \ge \exp_2(\exp_2(\cdot))$ with heights proportional to $D$), underscoring fundamental limits of these exploration strategies in such environments. The results have implications for the practical use of MCTS variants in game-playing AI, suggesting the need for improved exploration mechanisms or structural assumptions to bound regret in complex trees.

Abstract

We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $Ω(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.

Super-Exponential Regret for UCT, AlphaGo and Variants

TL;DR

This work tightens lower bounds on regret for UCT and its variants on the -chain, showing super-exponential (tower-like) growth in depth . By correcting an oversight for rewards in and extending the proofs to AlphaGo/AlphaZero-style MCTS, the authors demonstrate that both Polynomial UCT and standard UCT, as well as AlphaZero-family search, incur immense regret on deceptive search graphs. The bounds take the form of nested exponentials in (e.g., with heights proportional to ), underscoring fundamental limits of these exploration strategies in such environments. The results have implications for the practical use of MCTS variants in game-playing AI, suggesting the need for improved exploration mechanisms or structural assumptions to bound regret in complex trees.

Abstract

We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have regret (with exp terms) on the -chain environment, and that a `polynomial' UCT variant has regret on the same environment -- the original proofs contain an oversight for rewards bounded in , which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show regret.
Paper Structure (5 sections, 20 equations, 1 figure)

This paper contains 5 sections, 20 equations, 1 figure.

Figures (1)

  • Figure 1: The $D$-chain environment. Edge labels are actions, and node labels are rewards.

Theorems & Definitions (2)

  • Remark 1
  • Remark 2