Table of Contents
Fetching ...

Generating Adversarial Examples with Adversarial Networks

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, Dawn Song

TL;DR

AdvGAN introduces a GAN-based framework to generate perceptually realistic adversarial examples efficiently. By training a generator-discriminator pair and integrating a targeted adversarial loss, AdvGAN produces perturbations that fool target models while remaining visually plausible. The approach extends to semi-whitebox and black-box settings, using static and dynamic distillation to substitute inaccessible models. Empirical results on MNIST, CIFAR-10, and ImageNet-scale tasks show high attack success rates, robustness under defenses, and high perceptual quality, highlighting its potential to stress-test defenses and inform adversarial training strategies.

Abstract

Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efficiently requires more research efforts. In this paper, we propose AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For AdvGAN, once the generator is trained, it can generate adversarial perturbations efficiently for any instance, so as to potentially accelerate adversarial training as defenses. We apply AdvGAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by AdvGAN on different target models have high attack success rate under state-of-the-art defenses compared to other attacks. Our attack has placed the first with 92.76% accuracy on a public MNIST black-box attack challenge.

Generating Adversarial Examples with Adversarial Networks

TL;DR

AdvGAN introduces a GAN-based framework to generate perceptually realistic adversarial examples efficiently. By training a generator-discriminator pair and integrating a targeted adversarial loss, AdvGAN produces perturbations that fool target models while remaining visually plausible. The approach extends to semi-whitebox and black-box settings, using static and dynamic distillation to substitute inaccessible models. Empirical results on MNIST, CIFAR-10, and ImageNet-scale tasks show high attack success rates, robustness under defenses, and high perceptual quality, highlighting its potential to stress-test defenses and inform adversarial training strategies.

Abstract

Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efficiently requires more research efforts. In this paper, we propose AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For AdvGAN, once the generator is trained, it can generate adversarial perturbations efficiently for any instance, so as to potentially accelerate adversarial training as defenses. We apply AdvGAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by AdvGAN on different target models have high attack success rate under state-of-the-art defenses compared to other attacks. Our attack has placed the first with 92.76% accuracy on a public MNIST black-box attack challenge.

Paper Structure

This paper contains 18 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of AdvGAN
  • Figure 2: Adversarial examples generated from the same original image to different targets by AdvGAN on MNIST. Row 1: semi-whitebox attack; Row 2: black-box attack. Left to right: models A, B, and C.On the diagonal, the original images are shown, and the numer on the top denote the targets.
  • Figure 3: Adversarial examples generated by AdvGAN on CIFAR-10 for (a) semi-whitebox attack and (b) black-box attack. Image from each class is perturbed to other different classes. On the diagonal, the original images are shown.
  • Figure 4: Examples from an ImageNet-compatible set, and the labels denote corresponding classification results Left: original benign images; right: adversarial images generated by AdvGAN against Inception_v3.