Table of Contents
Fetching ...

The Limitations of Deep Learning in Adversarial Settings

Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, Ananthram Swami

TL;DR

The paper investigates vulnerabilities of deep neural networks to adversarial inputs by formalizing an adversary space and introducing forward-derivative–based adversarial saliency maps to craft targeted misclassifications. It presents a general three-step attack framework that computes the forward derivative (Jacobian) of the DNN, builds adversarial saliency maps, and iteratively perturbs input features to achieve a chosen output $\mathbf{Y}^*$ under a distortion constraint $\Upsilon$. Empirically, the method achieves a $97.1\%$ success rate on MNIST with an average perturbation of about $4\%$ of input features, across all source-target class pairs, while remaining largely imperceptible to humans at moderate distortions. Additionally, the work introduces quantitative defenses-oriented tools—class-pair hardness and adversarial distance—to assess robustness and guide defense design, supported by human-subject perception studies. The results highlight practical implications for security in DL systems and outline directions for adversarial training and detection to mitigate such attacks in real-world settings.

Abstract

Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.

The Limitations of Deep Learning in Adversarial Settings

TL;DR

The paper investigates vulnerabilities of deep neural networks to adversarial inputs by formalizing an adversary space and introducing forward-derivative–based adversarial saliency maps to craft targeted misclassifications. It presents a general three-step attack framework that computes the forward derivative (Jacobian) of the DNN, builds adversarial saliency maps, and iteratively perturbs input features to achieve a chosen output under a distortion constraint . Empirically, the method achieves a success rate on MNIST with an average perturbation of about of input features, across all source-target class pairs, while remaining largely imperceptible to humans at moderate distortions. Additionally, the work introduces quantitative defenses-oriented tools—class-pair hardness and adversarial distance—to assess robustness and guide defense design, supported by human-subject perception studies. The results highlight practical implications for security in DL systems and outline directions for adversarial training and detection to mitigate such attacks in real-world settings.

Abstract

Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.

Paper Structure

This paper contains 26 sections, 16 equations, 17 figures, 3 algorithms.

Figures (17)

  • Figure 1: Adversarial sample generation - Distortion is added to input samples to force the DNN to output adversary-selected classification (min distortion $= 0.26\%$, max distortion $= 13.78\%$, and average distortion $\varepsilon=4.06\%$).
  • Figure 2: Threat Model Taxonomy: our class of algorithms operates in the threat model indicated by a star.
  • Figure 3: Simplified Multi-Layer Perceptron architecture with input $\mathbf{X}=(x_1,x_2)$, hidden layer $(h_1,h_2)$, and output $o$.
  • Figure 4: The output surface of our simplified Multi-Layer Perceptron for the input domain $[0,1]^2$. Blue corresponds to a $0$ output while yellow corresponds to a $1$ output.
  • Figure 5: Forward derivative of our simplified multi-layer perceptron according to input neuron $x_2$. Sample $\mathbf{X}$ is benign and $\mathbf{X}^*$ is adversarial, crafted by adding $\delta_\mathbf{X}=(0,\delta x_2)$.
  • ...and 12 more figures