How adversarial attacks can disrupt seemingly stable accurate classifiers

Oliver J. Sutton; Qinghua Zhou; Ivan Y. Tyukin; Alexander N. Gorban; Alexander Bastounis; Desmond J. Higham

How adversarial attacks can disrupt seemingly stable accurate classifiers

Oliver J. Sutton, Qinghua Zhou, Ivan Y. Tyukin, Alexander N. Gorban, Alexander Bastounis, Desmond J. Higham

TL;DR

A simple generic and generalisable framework is introduced for which key behaviours observed in practical systems arise with high probability-notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data.

Abstract

Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

How adversarial attacks can disrupt seemingly stable accurate classifiers

TL;DR

Abstract

Paper Structure (58 sections, 19 theorems, 185 equations, 22 figures, 25 tables)

This paper contains 58 sections, 19 theorems, 185 equations, 22 figures, 25 tables.

Introduction
Notation
The paradox of apparent stability demonstrated on standard datasets
The essence of the paradox
Random perturbations are inefficient for detecting adversarial instability
A simple theoretical model captures the essence of the paradox
A generalised theoretical model
Further generalisations
Class separation margins hide adversarial susceptibility
Discussion and relation to prior work
Existence of adversarial examples
Fragility of adversarial examples
Certifying robustness of classifiers to adversarial perturbations
Universal adversarial perturbations
Notions of stability
...and 43 more sections

Key Result

Theorem 1

Let $x \in \mathbb{R}^n$ and let $\Pi$ be a planar decision surface passing distance $\epsilon > 0$ from $x$. Suppose (without loss of generality since the setup is invariant to rigid translations) that $\Pi$ passes through the origin. Suppose that points $z$ are sampled uniformly from a ball of rad

Figures (22)

Figure 1: Histograms showing the fraction of images from the 'aeroplane-vs-cat' binary classification problem (from the CIFAR-10 dataset) which were misclassified after either (a) an adversarial attack (as the fraction of ordinarily correctly classified images) or (b) a random perturbation of different sizes (as the fraction of images which were susceptible to adversarial attacks), measured as the maximum absolute change to an individual pixel channel (the $\ell^{\infty}$ norm). For adversarial attacks, this represents the smallest misclassifying attack in the adversarial direction. For the random perturbations, we record the smallest $\ell^{\infty}$ norm among 2000 misclassifying perturbations sampled from the Euclidean ball with radius $5\epsilon$, where $\epsilon$ is the Euclidean norm of the smallest successful adversarial attack found for each image. Examples are shown at the size of their respective perturbation norms. Full details of the experimental results are given in Section \ref{['sec:full-experiments']}.
Figure 2: A data point $x$ and the (locally linear) decision surface of a classifier $f$ (solid line). The point $x$ is susceptible to an adversarial attack of size $\epsilon$, and randomly perturbed using random noise of size $\leq \delta$. These perturbed points are sampled from the within dashed ball.
Figure 3: Two unit balls with centres separated by distance $2\epsilon$, and the decision surface of the classifier $f$ (dashed).
Figure 4: Comparison of the theoretical bounds in Theorems \ref{['thm:accuracy']} and \ref{['thm:undetectability']} against empirical results computed using 10,000 data points sampled from $\mathcal{D}_{\epsilon}$, with $\epsilon = 0.05$, and 10,000 perturbations sampled from $\mathcal{U}(\mathbb{B}^n_{\delta})$ for various values of $\delta$. We see that, even for perturbations 50 times larger than the separation distance between the balls (i.e. $\delta = 2.5$), the probability of randomly sampling a perturbation which changes the classification of a random data point is very small in high dimensions.
Figure 5: Different scenarios to which the simple two ball model may be generalised.
...and 17 more figures

Theorems & Definitions (23)

Theorem 1: Random perturbations are inefficient for detecting adversarial instability
Definition 2: Smeared Absolute Continuity (SmAC) gorban2018correction
Definition 3: Single-particle SmAC with bounded density
Theorem 4: The classifier is accurate
Theorem 5: Susceptible data points are typical
Theorem 6: Destabilising perturbations are rare
Theorem 7: Gradient-based methods find the optimal adversarial attack
Theorem 8: Universality of adversarial attacks
Theorem 9: Accuracy of the classifier $f$
Corollary 10: Accuracy for SmAC distributions
...and 13 more

How adversarial attacks can disrupt seemingly stable accurate classifiers

TL;DR

Abstract

How adversarial attacks can disrupt seemingly stable accurate classifiers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (23)