Super-Exponential Regret for UCT, AlphaGo and Variants

Laurent Orseau; Remi Munos

Super-Exponential Regret for UCT, AlphaGo and Variants

Laurent Orseau, Remi Munos

TL;DR

This work tightens lower bounds on regret for UCT and its variants on the $D$-chain, showing super-exponential (tower-like) growth in depth $D$. By correcting an oversight for rewards in $[0,1]$ and extending the proofs to AlphaGo/AlphaZero-style MCTS, the authors demonstrate that both Polynomial UCT and standard UCT, as well as AlphaZero-family search, incur immense regret on deceptive search graphs. The bounds take the form of nested exponentials in $D$ (e.g., $T \ge \exp_2(\exp_2(\cdot))$ with heights proportional to $D$), underscoring fundamental limits of these exploration strategies in such environments. The results have implications for the practical use of MCTS variants in game-playing AI, suggesting the need for improved exploration mechanisms or structural assumptions to bound regret in complex trees.

Abstract

We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $Ω(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.

Super-Exponential Regret for UCT, AlphaGo and Variants

TL;DR

This work tightens lower bounds on regret for UCT and its variants on the

-chain, showing super-exponential (tower-like) growth in depth

. By correcting an oversight for rewards in

and extending the proofs to AlphaGo/AlphaZero-style MCTS, the authors demonstrate that both Polynomial UCT and standard UCT, as well as AlphaZero-family search, incur immense regret on deceptive search graphs. The bounds take the form of nested exponentials in

(e.g.,

with heights proportional to

), underscoring fundamental limits of these exploration strategies in such environments. The results have implications for the practical use of MCTS variants in game-playing AI, suggesting the need for improved exploration mechanisms or structural assumptions to bound regret in complex trees.

Abstract

We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have

regret (with

exp terms) on the

-chain environment, and that a `polynomial' UCT variant has

regret on the same environment -- the original proofs contain an oversight for rewards bounded in

, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show

regret.

Paper Structure (5 sections, 20 equations, 1 figure)

This paper contains 5 sections, 20 equations, 1 figure.

Introduction
The $D$-chain environment
Polynomial UCT lower bound
AlphaZero lower bound
UCT lower bound

Figures (1)

Figure 1: The $D$-chain environment. Edge labels are actions, and node labels are rewards.

Theorems & Definitions (2)

Remark 1
Remark 2

Super-Exponential Regret for UCT, AlphaGo and Variants

TL;DR

Abstract

Super-Exponential Regret for UCT, AlphaGo and Variants

Authors

TL;DR

Abstract

Table of Contents

Figures (1)

Theorems & Definitions (2)