Table of Contents
Fetching ...

How adversarial attacks can disrupt seemingly stable accurate classifiers

Oliver J. Sutton, Qinghua Zhou, Ivan Y. Tyukin, Alexander N. Gorban, Alexander Bastounis, Desmond J. Higham

TL;DR

A simple generic and generalisable framework is introduced for which key behaviours observed in practical systems arise with high probability-notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data.

Abstract

Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.

How adversarial attacks can disrupt seemingly stable accurate classifiers

TL;DR

A simple generic and generalisable framework is introduced for which key behaviours observed in practical systems arise with high probability-notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data.

Abstract

Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.
Paper Structure (58 sections, 19 theorems, 185 equations, 22 figures, 25 tables)

This paper contains 58 sections, 19 theorems, 185 equations, 22 figures, 25 tables.

Key Result

Theorem 1

Let $x \in \mathbb{R}^n$ and let $\Pi$ be a planar decision surface passing distance $\epsilon > 0$ from $x$. Suppose (without loss of generality since the setup is invariant to rigid translations) that $\Pi$ passes through the origin. Suppose that points $z$ are sampled uniformly from a ball of rad

Figures (22)

  • Figure 1: Histograms showing the fraction of images from the 'aeroplane-vs-cat' binary classification problem (from the CIFAR-10 dataset) which were misclassified after either (a) an adversarial attack (as the fraction of ordinarily correctly classified images) or (b) a random perturbation of different sizes (as the fraction of images which were susceptible to adversarial attacks), measured as the maximum absolute change to an individual pixel channel (the $\ell^{\infty}$ norm). For adversarial attacks, this represents the smallest misclassifying attack in the adversarial direction. For the random perturbations, we record the smallest $\ell^{\infty}$ norm among 2000 misclassifying perturbations sampled from the Euclidean ball with radius $5\epsilon$, where $\epsilon$ is the Euclidean norm of the smallest successful adversarial attack found for each image. Examples are shown at the size of their respective perturbation norms. Full details of the experimental results are given in Section \ref{['sec:full-experiments']}.
  • Figure 2: A data point $x$ and the (locally linear) decision surface of a classifier $f$ (solid line). The point $x$ is susceptible to an adversarial attack of size $\epsilon$, and randomly perturbed using random noise of size $\leq \delta$. These perturbed points are sampled from the within dashed ball.
  • Figure 3: Two unit balls with centres separated by distance $2\epsilon$, and the decision surface of the classifier $f$ (dashed).
  • Figure 4: Comparison of the theoretical bounds in Theorems \ref{['thm:accuracy']} and \ref{['thm:undetectability']} against empirical results computed using 10,000 data points sampled from $\mathcal{D}_{\epsilon}$, with $\epsilon = 0.05$, and 10,000 perturbations sampled from $\mathcal{U}(\mathbb{B}^n_{\delta})$ for various values of $\delta$. We see that, even for perturbations 50 times larger than the separation distance between the balls (i.e. $\delta = 2.5$), the probability of randomly sampling a perturbation which changes the classification of a random data point is very small in high dimensions.
  • Figure 5: Different scenarios to which the simple two ball model may be generalised.
  • ...and 17 more figures

Theorems & Definitions (23)

  • Theorem 1: Random perturbations are inefficient for detecting adversarial instability
  • Definition 2: Smeared Absolute Continuity (SmAC) gorban2018correction
  • Definition 3: Single-particle SmAC with bounded density
  • Theorem 4: The classifier is accurate
  • Theorem 5: Susceptible data points are typical
  • Theorem 6: Destabilising perturbations are rare
  • Theorem 7: Gradient-based methods find the optimal adversarial attack
  • Theorem 8: Universality of adversarial attacks
  • Theorem 9: Accuracy of the classifier $f$
  • Corollary 10: Accuracy for SmAC distributions
  • ...and 13 more