Table of Contents
Fetching ...

Adversarial and Clean Data Are Not Twins

Zhitao Gong, Wenlu Wang, Wei-Shinn Ku

TL;DR

The paper addresses the threat of adversarial examples by showing that a simple binary classifier can reliably separate adversarial from clean data (>99% accuracy) and remains robust to second-round attacks, though generalization gaps persist across attack types and defenses. It characterizes adversarial data generation into model-independent and model-dependent methods, detailing L-BFGS-based and gradient-based attacks (FGSM, TGSM, JSMA). Experimental results across MNIST, CIFAR-10, and SVHN reveal strong practical performance but highlight sensitivity to perturbation scale and attack algorithm, suggesting the adversarial and clean datasets inhabit distinct distributions. The work emphasizes a practical preprocessing defense while framing fundamental limitations in defense generalization and motivates future work on the space disparity of adversarial methods.

Abstract

Adversarial attack has cast a shadow on the massive success of deep neural networks. Despite being almost visually identical to the clean data, the adversarial images can fool deep neural networks into wrong predictions with very high confidence. In this paper, however, we show that we can build a simple binary classifier separating the adversarial apart from the clean data with accuracy over 99%. We also empirically show that the binary classifier is robust to a second-round adversarial attack. In other words, it is difficult to disguise adversarial samples to bypass the binary classifier. Further more, we empirically investigate the generalization limitation which lingers on all current defensive methods, including the binary classifier approach. And we hypothesize that this is the result of intrinsic property of adversarial crafting algorithms.

Adversarial and Clean Data Are Not Twins

TL;DR

The paper addresses the threat of adversarial examples by showing that a simple binary classifier can reliably separate adversarial from clean data (>99% accuracy) and remains robust to second-round attacks, though generalization gaps persist across attack types and defenses. It characterizes adversarial data generation into model-independent and model-dependent methods, detailing L-BFGS-based and gradient-based attacks (FGSM, TGSM, JSMA). Experimental results across MNIST, CIFAR-10, and SVHN reveal strong practical performance but highlight sensitivity to perturbation scale and attack algorithm, suggesting the adversarial and clean datasets inhabit distinct distributions. The work emphasizes a practical preprocessing defense while framing fundamental limitations in defense generalization and motivates future work on the space disparity of adversarial methods.

Abstract

Adversarial attack has cast a shadow on the massive success of deep neural networks. Despite being almost visually identical to the clean data, the adversarial images can fool deep neural networks into wrong predictions with very high confidence. In this paper, however, we show that we can build a simple binary classifier separating the adversarial apart from the clean data with accuracy over 99%. We also empirically show that the binary classifier is robust to a second-round adversarial attack. In other words, it is difficult to disguise adversarial samples to bypass the binary classifier. Further more, we empirically investigate the generalization limitation which lingers on all current defensive methods, including the binary classifier approach. And we hypothesize that this is the result of intrinsic property of adversarial crafting algorithms.

Paper Structure

This paper contains 9 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The adversarial images (second row) are generated from the first row via iterative FGSM. The label of each image is shown below with prediction probability in parenthesis. Our model achieves less then 1% error rate on the clean data.
  • Figure 2: Adversarial training huang2015-learningkurakin2016-adversarial-1 does not work. This is a church window plot warde-farley2016-adversarial. Each pixel $(i, j)$ (row index and column index pair) represents a data point $\tilde{x}$ in the input space and $\tilde{x} = x + \vb{h}\epsilon_j + \vb{v}\epsilon_i$, where $\vb{h}$ is the direction computed by FGSM and $\vb{v}$ is a random direction orthogonal to $\vb{h}$. The $\epsilon$ ranges from $[-0.5, 0.5]$ and $\epsilon_{(\cdot)}$ is the interpolated value in between. The central black dot represents the original data point $x$, the orange dot (on the right of the center dot) represents the last adversarial sample created from $x$ via FGSM that is used in the adversarial training and the blue dot represents a random adversarial sample created from $x$ that cannot be recognized with adversarial training. The three digits below each image, from left to right, are the data samples that correspond to the black dot, orange dot and blue dot, respectively. ( ) represents the data samples that are always correctly (incorrectly) recognized by the model. represents the adversarial samples that can be correctly recognized without adversarial training only. And represents the data points that were correctly recognized with adversarial training only, i.e., the side effect of adversarial training.