Table of Contents
Fetching ...

Accelerating Self-Play Learning in Go

David J. Wu

TL;DR

KataGo delivers a data-efficient Go self-play learner by combining general algorithmic enhancements with Go-specific features. It introduces playout cap randomization, policy target pruning, global pooling, and auxiliary targets to improve learning efficiency, supplemented by domain-specific ownership/score targets and tailored input features. Empirical results show roughly 50x compute savings over ELF and stronger performance than Leela Zero at similar network sizes, highlighting a substantial efficiency gap between purely general AlphaZero-like methods and domain-informed approaches. The work demonstrates practical advancements toward scalable self-play in large state spaces with limited resources and suggests broad applicability to other reinforcement learning problems.

Abstract

By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neural-net-guided Monte Carlo tree search self-play. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF's final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without large-scale computational resources.

Accelerating Self-Play Learning in Go

TL;DR

KataGo delivers a data-efficient Go self-play learner by combining general algorithmic enhancements with Go-specific features. It introduces playout cap randomization, policy target pruning, global pooling, and auxiliary targets to improve learning efficiency, supplemented by domain-specific ownership/score targets and tailored input features. Empirical results show roughly 50x compute savings over ELF and stronger performance than Leela Zero at similar network sizes, highlighting a substantial efficiency gap between purely general AlphaZero-like methods and domain-informed approaches. The work demonstrates practical advancements toward scalable self-play in large state spaces with limited resources and suggests broad applicability to other reinforcement learning problems.

Abstract

By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neural-net-guided Monte Carlo tree search self-play. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF's final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without large-scale computational resources.

Paper Structure

This paper contains 26 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Log policy of 10-block nets, white to play. Left: trained with forced playouts and policy target pruning. Right: trained without. Dark/red through bright green ranges from about $p=\text{2e-4}$ to $p=1$. Pruning reduces the policy mass on many bad moves near the edges.
  • Figure 2: Global pooling bias structure, globally aggregating values of one set of channels to bias another set of channels.
  • Figure 3: Visualization of ownership predictions by the trained neural net.
  • Figure 4: 1600-visit Elo progression of KataGo (blue, leftmost) vs. Leela Zero (red, center) and ELF (green diamond). X-axis: self-play cost in billions of equivalent 20 block x 256 channel queries. Note the log-scale. Leela Zero's costs are highly approximate.
  • Figure 5: KataGo's main run versus Fixed runs. X-axis is the cumulative self-play cost in millions of equivalent 20 block x 256 channel queries.
  • ...and 3 more figures