Super-Exponential Regret for UCT, AlphaGo and Variants
Laurent Orseau, Remi Munos
TL;DR
This work tightens lower bounds on regret for UCT and its variants on the $D$-chain, showing super-exponential (tower-like) growth in depth $D$. By correcting an oversight for rewards in $[0,1]$ and extending the proofs to AlphaGo/AlphaZero-style MCTS, the authors demonstrate that both Polynomial UCT and standard UCT, as well as AlphaZero-family search, incur immense regret on deceptive search graphs. The bounds take the form of nested exponentials in $D$ (e.g., $T \ge \exp_2(\exp_2(\cdot))$ with heights proportional to $D$), underscoring fundamental limits of these exploration strategies in such environments. The results have implications for the practical use of MCTS variants in game-playing AI, suggesting the need for improved exploration mechanisms or structural assumptions to bound regret in complex trees.
Abstract
We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $Ω(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.
