Table of Contents
Fetching ...

Adversarial Policies Beat Superhuman Go AIs

Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

TL;DR

This work reveals that even superhuman Go AIs like KataGo are vulnerable to adversarial policies that induce decisive blunders rather than playing optimal Go. By introducing Adversarial MCTS (A-MCTS) and two adversaries (pass-adversary and cyclic-adversary) trained against a fixed KataGo victim, the authors achieve high win rates against KataGo with limited compute and demonstrate zero-shot transfer to other superhuman Go AIs. Adversarial training offers only partial robustness; a defense can be circumvented by fine-tuning, and the cyclic vulnerability persists under substantial search budgets, highlighting the need for robust, multi-agent and defense-oriented approaches. The results carry broad implications for AI safety, showing capabilities do not automatically translate to robustness and underscoring the importance of adversarial evaluation beyond capability benchmarks.

Abstract

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.

Adversarial Policies Beat Superhuman Go AIs

TL;DR

This work reveals that even superhuman Go AIs like KataGo are vulnerable to adversarial policies that induce decisive blunders rather than playing optimal Go. By introducing Adversarial MCTS (A-MCTS) and two adversaries (pass-adversary and cyclic-adversary) trained against a fixed KataGo victim, the authors achieve high win rates against KataGo with limited compute and demonstrate zero-shot transfer to other superhuman Go AIs. Adversarial training offers only partial robustness; a defense can be circumvented by fine-tuning, and the cyclic vulnerability persists under substantial search budgets, highlighting the need for robust, multi-agent and defense-oriented approaches. The results carry broad implications for AI safety, showing capabilities do not automatically translate to robustness and underscoring the importance of adversarial evaluation beyond capability benchmarks.

Abstract

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.
Paper Structure (77 sections, 7 equations, 62 figures, 8 tables)

This paper contains 77 sections, 7 equations, 62 figures, 8 tables.

Figures (62)

  • Figure 1.1: Games between the strongest KataGo network at the time of conducting this research (which we refer to as Latest) and two different types of adversaries we trained. (a) Our cyclic-adversary beats KataGo even when KataGo plays with far more search than is needed to be superhuman. The adversary lures the victim into letting a large group of cyclic victim stones ($\mathbf{\times}$) get captured by the adversary's next move ($\Delta$). Appendix \ref{['app:kellin-donut-analysis']} has a detailed description of this adversary's behavior. (b) Our pass-adversary beats no-search KataGo by tricking it into passing. The adversary then passes in turn, ending the game with the adversary winning under the Tromp-Taylor ruleset for computer Go tromp:2014 that KataGo was trained and configured to use (see Appendix \ref{['app:rules']}). The adversary gets points for its territory in the bottom-right corner (devoid of victim stones) whereas the victim does not get points for the territory in the top-left due to the presence of the adversary's stones.
  • Figure 2.1: A human amateur beats our adversarial policy (Appendix \ref{['app:experiments:human-vs-adversary']}) that beats KataGo. This non-transitivity shows the adversary is not a generally capable policy, and is just exploiting KataGo.
  • Figure 2.1: Black moves next in this game. There is a seki in the bottom left corner of the board. Neither black nor white should play in either square marked with $\Delta$, or else the other player will play in the other square and capture the opponent's stones. If Latest with 128 visits plays as black, it will pass. On the other hand, Latest$_\texttt{def}$ with 128 visits playing as black will play in one of the marked squares and lose its stones.
  • Figure 4.1: MCTS (left) builds a search tree one node at a time. To add a node, it walks down the tree until a new leaf is reached (red arrows). At a node $x$, the next step of the walk is determined by a PUCT rosinMultiarmedBanditsEpisode2011 algorithm (solid arrows) which takes into account neural network evaluations of each node in the subtree of $x$. A-MCTS-S (middle) walks down the tree by using a modified PUCT algorithm at adversary nodes, and sampling directly from the victim's policy network (dashed arrows) at victim nodes. A-MCTS-R (right) performs a full simulation of the victim as opposed to sampling from the victim's policy net. Search trees are depicted as binary for illustrative purposes only. See Appendix \ref{['app:search-algorithms']} for full details.
  • Figure 4.1: The compute used for adversary training ($y$-axis) as a function of the number of adversary training steps taken ($x$-axis). The plots here mirror the structure of Figure \ref{['fig:evaluation:training-curve-no-search']} and Figure \ref{['fig:evaluation:training-curve-cyclic']}. Top: The compute of the pass-adversary is a linear function of its training steps because the pass-adversary was trained against victims of similar size, all of which used no search (Appendix \ref{['app:hyperparameters:no-search']}). Bottom: In contrast, the compute of the cyclic-adversary is highly non-linear due to training against a wider range of victim sizes and the exponential ramp up of victim search at the end of its curriculum (Appendix \ref{['app:hyperparameters:hardened-curriculum']}).
  • ...and 57 more figures