Table of Contents
Fetching ...

Exploring the Space of Adversarial Images

Pedro Tabacof, Eduardo Valle

TL;DR

This paper formalizes adversarial image generation for pretrained classifiers as a constrained optimization problem, revealing nonconvexity even in linear models. It analyzes the pixel-space geometry by perturbing images with random noise and evaluating label stability across MNIST (logistic and ConvNet) and ImageNet (OverFeat), showing that adversarial regions are large and not isolated. The study finds that shallow classifiers can be more robust than deep networks on the same task and highlights the influence of perturbation distribution, especially heavier-tailed noise, on robustness. The results challenge simple linearity explanations for adversarial vulnerability and point to complex spatial structures that warrant further defense-oriented research.

Abstract

Adversarial examples have raised questions regarding the robustness and security of deep neural networks. In this work we formalize the problem of adversarial images given a pretrained classifier, showing that even in the linear case the resulting optimization problem is nonconvex. We generate adversarial images using shallow and deep classifiers on the MNIST and ImageNet datasets. We probe the pixel space of adversarial images using noise of varying intensity and distribution. We bring novel visualizations that showcase the phenomenon and its high variability. We show that adversarial images appear in large regions in the pixel space, but that, for the same task, a shallow classifier seems more robust to adversarial images than a deep convolutional network.

Exploring the Space of Adversarial Images

TL;DR

This paper formalizes adversarial image generation for pretrained classifiers as a constrained optimization problem, revealing nonconvexity even in linear models. It analyzes the pixel-space geometry by perturbing images with random noise and evaluating label stability across MNIST (logistic and ConvNet) and ImageNet (OverFeat), showing that adversarial regions are large and not isolated. The study finds that shallow classifiers can be more robust than deep networks on the same task and highlights the influence of perturbation distribution, especially heavier-tailed noise, on robustness. The results challenge simple linearity explanations for adversarial vulnerability and point to complex spatial structures that warrant further defense-oriented research.

Abstract

Adversarial examples have raised questions regarding the robustness and security of deep neural networks. In this work we formalize the problem of adversarial images given a pretrained classifier, showing that even in the linear case the resulting optimization problem is nonconvex. We generate adversarial images using shallow and deep classifiers on the MNIST and ImageNet datasets. We probe the pixel space of adversarial images using noise of varying intensity and distribution. We bring novel visualizations that showcase the phenomenon and its high variability. We show that adversarial images appear in large regions in the pixel space, but that, for the same task, a shallow classifier seems more robust to adversarial images than a deep convolutional network.

Paper Structure

This paper contains 9 sections, 2 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Fixed-sized images occupy a high-dimensional space spanned by their pixels (one pixel = one dimension), here depicted as a 2D colormap. Left: classifiers associate points of the input pixel space to output class labels, here 'banana' (blue) and 'mushroom' (red). From a correctly classified original image (a), an optimization procedure (dashed arrows) can find adversarial examples that are, for humans, essentially equal to the original, but that will fool the classifier. Right: we probe the pixel space by taking a departing image (white diamond), adding random noise to it (black stars), and asking the classifier for the label. In compact, stable regions, the classifier will be consistent (even if wrong). In isolated, unstable regions, as depicted, the classifier will be erratic.
  • Figure 2: Adversarial examples for each network. For all experiments: original images on the top row, adversarial images on the bottom row, distortions (difference between original and adversarial images) on the middle row.
  • Figure 3: Adding Gaussian noise to the images. We perform the probing procedure explained in Section \ref{['sec:methods']} to measure the stability of the classifier boundaries at different points of the pixel space. To escape the adversarial pockets completely we have to add a noise considerably stronger than the original distortion used to reach them in the first place: adversarial regions are not isolated. That is especially true for ImageNet/OverFeat. Still, the region around the correctly classified original image is much more stable. This graph is heavily averaged: each stacked column along the horizontal axis summarizes 125 experiments $\times$ 100 random probes.
  • Figure 4: Adding Gaussian noise to the images. Another view of the probing procedure explained in Section \ref{['sec:methods']}. Contrarily to the averaged view of Figure \ref{['fig:all_averages']}, here each one of the 125 experiments appears as an independent curve along the Experiments axis (their order is arbitrary, chosen to reduce occlusions). Each point of the curve is the fraction of probes (out of a hundred performed) that keeps their class label.
  • Figure 5: For each of the 125 experiments we measure the fraction of the probe images (i.e., departing image + random noise) that stayed in the same class label. Those fractions are then sorted from biggest to lowest along the Experiments axis. The area under the curves indicates the entire fraction of probes among all experiments that stayed in the same class.