Table of Contents
Fetching ...

Adversarial Transformation Networks: Learning to Generate Adversarial Examples

Shumeet Baluja, Ian Fischer

TL;DR

<3-5 sentence high-level summary> Adversarial Transformation Networks (ATNs) offer a fast, self-supervised framework for generating targeted adversarial examples by training a separate network to transform inputs into misclassifications against a fixed target model. The approach introduces two ATN variants—Perturbation ATN (P-ATN) and Adversarial Autoencoding ATN (AAE-ATN)—and leverages a reranking-based loss to enforce targeted outputs while preserving the relative order of other class predictions. Extensive experiments on MNIST and ImageNet (Inception-ResNet-v2) demonstrate substantial, architecture-dependent trade-offs between perturbation locality and adversarial diversity, as well as interesting transfer and insider-information effects. The work suggests ATNs’ potential for robust adversarial training, defense research, and deeper insights into how classifiers encode target concepts, while outlining directions for future enhancements and black-box extensions.

Abstract

Multiple different approaches of generating adversarial examples have been proposed to attack deep neural networks. These approaches involve either directly computing gradients with respect to the image pixels, or directly solving an optimization on the image pixels. In this work, we present a fundamentally new method for generating adversarial examples that is fast to execute and provides exceptional diversity of output. We efficiently train feed-forward neural networks in a self-supervised manner to generate adversarial examples against a target network or set of networks. We call such a network an Adversarial Transformation Network (ATN). ATNs are trained to generate adversarial examples that minimally modify the classifier's outputs given the original input, while constraining the new classification to match an adversarial target class. We present methods to train ATNs and analyze their effectiveness targeting a variety of MNIST classifiers as well as the latest state-of-the-art ImageNet classifier Inception ResNet v2.

Adversarial Transformation Networks: Learning to Generate Adversarial Examples

TL;DR

<3-5 sentence high-level summary> Adversarial Transformation Networks (ATNs) offer a fast, self-supervised framework for generating targeted adversarial examples by training a separate network to transform inputs into misclassifications against a fixed target model. The approach introduces two ATN variants—Perturbation ATN (P-ATN) and Adversarial Autoencoding ATN (AAE-ATN)—and leverages a reranking-based loss to enforce targeted outputs while preserving the relative order of other class predictions. Extensive experiments on MNIST and ImageNet (Inception-ResNet-v2) demonstrate substantial, architecture-dependent trade-offs between perturbation locality and adversarial diversity, as well as interesting transfer and insider-information effects. The work suggests ATNs’ potential for robust adversarial training, defense research, and deeper insights into how classifiers encode target concepts, while outlining directions for future enhancements and black-box extensions.

Abstract

Multiple different approaches of generating adversarial examples have been proposed to attack deep neural networks. These approaches involve either directly computing gradients with respect to the image pixels, or directly solving an optimization on the image pixels. In this work, we present a fundamentally new method for generating adversarial examples that is fast to execute and provides exceptional diversity of output. We efficiently train feed-forward neural networks in a self-supervised manner to generate adversarial examples against a target network or set of networks. We call such a network an Adversarial Transformation Network (ATN). ATNs are trained to generate adversarial examples that minimally modify the classifier's outputs given the original input, while constraining the new classification to match an adversarial target class. We present methods to train ATNs and analyze their effectiveness targeting a variety of MNIST classifiers as well as the latest state-of-the-art ImageNet classifier Inception ResNet v2.

Paper Structure

This paper contains 27 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (Left) A simple classification network which takes input image $\mathbf{x}$. (Right) With the same input, $\mathbf{x}$, the ATN emits $\mathbf{x}\mathbf{'}$, which is fed into the classification network. In the example shown, the input digit is classified correctly as a 3 (on the left), ATN$_7$ takes $\mathbf{x}$ as input and generates a modified image ($3'$) such that the classifier outputs a 7 as the highest activation and the previous highest classification, 3, as the second highest activation (on the right).
  • Figure 2: Successful adversarial examples from ATN$_t$ against Classifier$_p$. Top is with the highest $\beta=0.010$. Bottom two are with $\beta=0.005~\&~0.001$, respectively. Note that as $\beta$ is decreased, the fidelity to the underlying digit decreases. The column in each block corresponds to the correct classification of the image. The row corresponds to the adversarial classification, $t$.
  • Figure 3: Typical transformations made to MNIST digits against Classifier$_p$. Black digits on the white background are output classifications from Classifier$_p$. The bottom classification is the original (correct) classification. The top classification is the result of classifying the adversarial example. White digits on black backgrounds are the MNIST digits and their transformations to adversarial examples. The bottom MNIST digits are unmodified, and the top are adversarial. In all of these images, the adversarial example is classified as $t=\mathop{\mathrm{argmax}}\limits \mathbf{y}\mathbf{'}$ while maintaining the second highest output in $\mathbf{y}\mathbf{'}$ as the original classification, $\mathop{\mathrm{argmax}}\limits \mathbf{y}$.
  • Figure 4: The ATN now has to fool three networks (of various architectures), while also minimizing $L_{\mathcal{X}}$, the reconstruction error.
  • Figure 5: Do the same transformed examples work well on all the networks? (Top) Percentage of examples that worked on exactly 0-3 training networks. (Bottom) Percentage of examples that worked on exactly 0-2 unseen networks. Note: these are all measured on independent test set images.
  • ...and 4 more figures