Accelerating Self-Play Learning in Go
David J. Wu
TL;DR
KataGo delivers a data-efficient Go self-play learner by combining general algorithmic enhancements with Go-specific features. It introduces playout cap randomization, policy target pruning, global pooling, and auxiliary targets to improve learning efficiency, supplemented by domain-specific ownership/score targets and tailored input features. Empirical results show roughly 50x compute savings over ELF and stronger performance than Leela Zero at similar network sizes, highlighting a substantial efficiency gap between purely general AlphaZero-like methods and domain-informed approaches. The work demonstrates practical advancements toward scalable self-play in large state spaces with limited resources and suggests broad applicability to other reinforcement learning problems.
Abstract
By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neural-net-guided Monte Carlo tree search self-play. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF's final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without large-scale computational resources.
