Can Go AIs be adversarially robust?

Tom Tseng; Euan McLean; Kellin Pelrine; Tony T. Wang; Adam Gleave

Can Go AIs be adversarially robust?

Tom Tseng, Euan McLean, Kellin Pelrine, Tony T. Wang, Adam Gleave

TL;DR

This paper investigates whether superhuman Go AIs can be made robust to adversarial attacks by testing three defenses: positional adversarial training, iterated adversarial training, and a Vision Transformer backbone. Across a suite of learned attacks, including cyclic and gift strategies, all defenses fail to provide full robustness; adaptively trained adversaries can still exploit weaknesses, and cyclic attacks persist even at high search budgets. A ViT Go AI also remains vulnerable to cyclic strategies, suggesting that robustness deficits are not solely due to CNN inductive biases or training regimes. The results highlight the challenge of achieving robust AI in tractable, adversarially structured settings and call for larger attack corpora, diverse defenses, and online or multi-agent approaches to approach practical robustness.

Abstract

Prior work found that superhuman Go AIs can be defeated by simple adversarial strategies, especially "cyclic" attacks. In this paper, we study whether adding natural countermeasures can achieve robustness in Go, a favorable domain for robustness since it benefits from incredible average-case capability and a narrow, innately adversarial setting. We test three defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that though some of these defenses protect against previously discovered attacks, none withstand freshly trained adversaries. Furthermore, most of the reliably effective attacks these adversaries discover are different realizations of the same overall class of cyclic attacks. Our results suggest that building robust AI systems is challenging even with extremely superhuman systems in some of the most tractable settings, and highlight two key gaps: efficient generalization of defenses, and diversity in training. For interactive examples of attacks and a link to our codebase, see https://goattack.far.ai.

Can Go AIs be adversarially robust?

TL;DR

Abstract

Paper Structure (94 sections, 2 equations, 82 figures, 6 tables)

This paper contains 94 sections, 2 equations, 82 figures, 6 tables.

Introduction
Threat Model, Robustness, Attack Method
Threat Model
Defining Robustness
Attack Method
Positional Adversarial Training
Defense Methodology
The Gift Adversary
Cyclic Attacks
Iterated Adversarial Training
Methodology
Results
Robustness Against the Iterated Adversaries
Robustness Against New Adversaries
Vision Transformers
...and 79 more sections

Figures (82)

Figure 1: Three strategies for defending Go AIs against adversarial attack. Left: Positional adversarial training has an agent "study" adversarial positions by performing self-play starting from those positions. Middle: Iterated adversarial training consists of multiple rounds of an adversary finding attacks and a victim learning to defend. Right: We replace KataGo's convolutional neural network (CNN) backbone with a vision transformer (ViT) backbone to see which vulnerabilities of Go AIs are caused by the inductive biases of CNNs.
Figure 2: Win rate (y-axis) of adversaries (legend) for varying amounts of search visits (x-axis) given to victims (plot title). The adversary win rate declines with victim search budget; however, some adversaries generalize better to higher victim visit counts than others. Shaded regions are 95% Clopper-Pearson confidence intervals in this and following figures.
Figure 3: Our learned adversarial strategies are qualitatively distinct. \ref{['fig:largeadv_boardstate']}, \ref{['fig:a9example']}, \ref{['fig:vitexample']} show cyclic attacks with the $\boldsymbol{\times}$ groups soon to be captured; these attacks use different styles of inside shapes, though these shapes have little impact on optimal play and are all easy for a human to navigate correctly. The gift-adversary in \ref{['fig:giftexample']} follows a different strategy, inducing the victim (white) to play the stone marked $\boldsymbol{\times}$ "gifting" the adversary two stones it can capture by playing at $\triangle$. Each subcaption links to a complete game history on our \demosite.
Figure 4: Win rate of all adversaries ($x$-axis) against all victims ($y$-axis) throughout iterated adversarial training for varying victim visits (plot title). The adversary $\texttt{a}_{n}$ is typically able to beat the victim $\texttt{v}_{n}$ it is trained to exploit (top-left-to-bottom-right diagonal), especially at 16 visits or less (middle and left plots). However, given at least 16 visits (middle and right) the victim $\texttt{v}_{n}$ is typically able to beat the adversary $\texttt{a}_{n-1}$ it trained against (elements immediately below main diagonal) along with all previous iterations $\texttt{a}_{n-2}$ , $\texttt{a}_{n-3}$ , $\cdots$. See Figure \ref{['fig:win-rate-heatmap-full']} for an extended version including other adversaries, victims and visit counts.
Figure 5: Win rate ($y$-axis) of base-adversary vs base-victim (---) and atari-adversary vs v9 ($\cdots$) by training compute ($x$-axis), including the 164 GPU days training atari-adversary's initialization checkpoint base-adv-early. The checkpoint marked $\blacklozenge$ is used for evaluation.
...and 77 more figures

Can Go AIs be adversarially robust?

TL;DR

Abstract

Can Go AIs be adversarially robust?

Authors

TL;DR

Abstract

Table of Contents

Figures (82)