Table of Contents
Fetching ...

MagNet: a Two-Pronged Defense against Adversarial Examples

Dongyu Meng, Hao Chen

TL;DR

<3-5 sentence high-level summary> Adversarial examples expose vulnerabilities in neural networks, motivating defenses that do not rely on specific attack generation. MagNet offers a two-pronged, attack-agnostic framework comprising detectors that estimate distance from the normal-data manifold and a reformer that moves inputs toward the manifold via autoencoders, with a cryptography-inspired diversity mechanism to defend against graybox attacks. The approach demonstrates strong robustness on MNIST and CIFAR-10 against a range of state-of-the-art attacks (FGSM, DeepFool, Carlini) while preserving most of the original classifier accuracy, and provides a formal threat-model discussion and evaluation. Overall, MagNet represents a practical step toward transferable, input-driven defenses against adversarial perturbations.

Abstract

Deep learning has shown promising results on hard perceptual problems in recent years. However, deep learning systems are found to be vulnerable to small adversarial perturbations that are nearly imperceptible to human. Such specially crafted perturbations cause deep learning systems to output incorrect decisions, with potentially disastrous consequences. These vulnerabilities hinder the deployment of deep learning systems where safety or security is important. Attempts to secure deep learning systems either target specific attacks or have been shown to be ineffective. In this paper, we propose MagNet, a framework for defending neural network classifiers against adversarial examples. MagNet does not modify the protected classifier or know the process for generating adversarial examples. MagNet includes one or more separate detector networks and a reformer network. Different from previous work, MagNet learns to differentiate between normal and adversarial examples by approximating the manifold of normal examples. Since it does not rely on any process for generating adversarial examples, it has substantial generalization power. Moreover, MagNet reconstructs adversarial examples by moving them towards the manifold, which is effective for helping classify adversarial examples with small perturbation correctly. We discuss the intrinsic difficulty in defending against whitebox attack and propose a mechanism to defend against graybox attack. Inspired by the use of randomness in cryptography, we propose to use diversity to strengthen MagNet. We show empirically that MagNet is effective against most advanced state-of-the-art attacks in blackbox and graybox scenarios while keeping false positive rate on normal examples very low.

MagNet: a Two-Pronged Defense against Adversarial Examples

TL;DR

<3-5 sentence high-level summary> Adversarial examples expose vulnerabilities in neural networks, motivating defenses that do not rely on specific attack generation. MagNet offers a two-pronged, attack-agnostic framework comprising detectors that estimate distance from the normal-data manifold and a reformer that moves inputs toward the manifold via autoencoders, with a cryptography-inspired diversity mechanism to defend against graybox attacks. The approach demonstrates strong robustness on MNIST and CIFAR-10 against a range of state-of-the-art attacks (FGSM, DeepFool, Carlini) while preserving most of the original classifier accuracy, and provides a formal threat-model discussion and evaluation. Overall, MagNet represents a practical step toward transferable, input-driven defenses against adversarial perturbations.

Abstract

Deep learning has shown promising results on hard perceptual problems in recent years. However, deep learning systems are found to be vulnerable to small adversarial perturbations that are nearly imperceptible to human. Such specially crafted perturbations cause deep learning systems to output incorrect decisions, with potentially disastrous consequences. These vulnerabilities hinder the deployment of deep learning systems where safety or security is important. Attempts to secure deep learning systems either target specific attacks or have been shown to be ineffective. In this paper, we propose MagNet, a framework for defending neural network classifiers against adversarial examples. MagNet does not modify the protected classifier or know the process for generating adversarial examples. MagNet includes one or more separate detector networks and a reformer network. Different from previous work, MagNet learns to differentiate between normal and adversarial examples by approximating the manifold of normal examples. Since it does not rely on any process for generating adversarial examples, it has substantial generalization power. Moreover, MagNet reconstructs adversarial examples by moving them towards the manifold, which is effective for helping classify adversarial examples with small perturbation correctly. We discuss the intrinsic difficulty in defending against whitebox attack and propose a mechanism to defend against graybox attack. Inspired by the use of randomness in cryptography, we propose to use diversity to strengthen MagNet. We show empirically that MagNet is effective against most advanced state-of-the-art attacks in blackbox and graybox scenarios while keeping false positive rate on normal examples very low.

Paper Structure

This paper contains 41 sections, 14 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An illustration of the reformer's effect on adversarial perturbations. The second row displays adversarial examples generated from the original normal examples in the first row by Carlini's $L^\infty$ attack. The third row shows their perturbations against the original examples, and these perturbations lack prominent patterns. The fourth row displays the adversarial examples after being reformed by MagNet. The fifth row displays the remaining perturbations in the reformed examples against their original examples in the first row, and these perturbations have the shapes of their original examples.
  • Figure 2: MagNet workflow in test phase. MagNet includes one or more detectors. It considers a test example $x$ adversarial if any detector considers $x$ adversarial. If $x$ is not considered adversarial, MagNet reforms it before feeding it to the target classifier.
  • Figure 3: Illustration of how detector and reformer work in a 2-D sample space. We represent the manifold of normal examples by a curve, and depict normal and adversarial examples by green dots and red crosses, respectively. We depict the transformation by autoencoder using arrows. The detector measures reconstruction error and rejects examples with large reconstruction errors (e.g. cross (3) in the figure), and the reformer finds an example near the manifold that approximates the original example (e.g. cross (1) in the figure).
  • Figure 4: Defense performance with different confidence of Carlini's $L^2$ attack on MNIST dataset. The performance is measured as the percentage of adversarial examples that are either detected by the detector, or classified correctly by the classifier.
  • Figure 5: Defense performance on different confidence of Carlini's $L^2$ attack on CIFAR-10 dataset. The performance is measured as the percentage of adversarial examples that are either detected by the detector, or classified correctly by the classifier.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5