Table of Contents
Fetching ...

Bayesian Optimization in AlphaGo

Yutian Chen, Aja Huang, Ziyu Wang, Ioannis Antonoglou, Julian Schrittwieser, David Silver, Nando de Freitas

TL;DR

Bayesian optimization was applied to tune AlphaGo's game-playing hyper-parameters across design iterations, yielding progressive strength gains and contributing to the match performance against Lee Sedol. The methods use Gaussian-process priors and Expected Improvement via Spearmint, modeling self-play win-rate with 50 games per evaluation to handle non-differentiable, expensive evaluations. Applied to five tuning tasks, the approach delivered substantial Elo gains and offered insights into parameter interactions and component contributions (e.g., fast roll-outs vs. value networks), with gains compounding over iterations. This work demonstrates a practical, data-efficient strategy for optimizing complex, non-differentiable hyper-parameter spaces in large reinforcement learning systems and informs the development of future self-play agents.

Abstract

During the development of AlphaGo, its many hyper-parameters were tuned with Bayesian optimization multiple times. This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its win-rate from 50% to 66.5% in self-play games. This tuned version was deployed in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage. It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.

Bayesian Optimization in AlphaGo

TL;DR

Bayesian optimization was applied to tune AlphaGo's game-playing hyper-parameters across design iterations, yielding progressive strength gains and contributing to the match performance against Lee Sedol. The methods use Gaussian-process priors and Expected Improvement via Spearmint, modeling self-play win-rate with 50 games per evaluation to handle non-differentiable, expensive evaluations. Applied to five tuning tasks, the approach delivered substantial Elo gains and offered insights into parameter interactions and component contributions (e.g., fast roll-outs vs. value networks), with gains compounding over iterations. This work demonstrates a practical, data-efficient strategy for optimizing complex, non-differentiable hyper-parameter spaces in large reinforcement learning systems and informs the development of future self-play agents.

Abstract

During the development of AlphaGo, its many hyper-parameters were tuned with Bayesian optimization multiple times. This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its win-rate from 50% to 66.5% in self-play games. This tuned version was deployed in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage. It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.

Paper Structure

This paper contains 10 sections, 6 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: One-dimensional illustration of Bayesian optimization with Gaussian processes (GPs) and the expected improvement acquisition (EI) function, over the first 6 iterations. The top plots show the GP mean in blue and the true unknown function in red. In the vicinity of query points, the uncertainty is reduced. The bottom plots shows the EI acquisition function and its proposed next query points. EI trades-off exploitation and exploration.
  • Figure 2: Leftmost three plots: estimated posterior mean and variance of the win-rate for three individual hyper-parameters while fixing the remaining hyper-parameters. The vertical bar shows the fixed reference parameter value. Rightmost plot: posterior mean for two hyper-parameters, showing the correlation among these.
  • Figure 3: Typical values of the observed and maximum expected win-rates as a function of the optimization steps.
  • Figure 4: Tuning the dynamic mixing ratio formula of AlphaGo.
  • Figure 5: Optimized time control formula for AlphaGo with a 2-hour main play time and 60-second byoyomi.