Table of Contents
Fetching ...

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, Ananthram Swami

TL;DR

This paper addresses the vulnerability of deep neural networks to adversarial perturbations by introducing defensive distillation, a training procedure that uses soft-target probability distributions (via a high softmax temperature) to produce smoother, more robust models. Analytically and empirically, the authors show that defensive distillation significantly reduces the success of adversarial sample crafting while preserving accuracy, and they quantify how gradient magnitudes and robustness metrics improve by large factors. The approach relies on using the same architecture for the original and distilled models, training with soft labels, and reverting to standard predictions at test time, making it a practical defense. Overall, defensive distillation enhances DNN resilience to adversarial attacks with minimal overhead and reasonable parameter tuning, offering a foundation for robust security-sensitive deployments.

Abstract

Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

TL;DR

This paper addresses the vulnerability of deep neural networks to adversarial perturbations by introducing defensive distillation, a training procedure that uses soft-target probability distributions (via a high softmax temperature) to produce smoother, more robust models. Analytically and empirically, the authors show that defensive distillation significantly reduces the success of adversarial sample crafting while preserving accuracy, and they quantify how gradient magnitudes and robustness metrics improve by large factors. The approach relies on using the same architecture for the original and distilled models, training with soft labels, and reverting to standard predictions at test time, making it a practical defense. Overall, defensive distillation enhances DNN resilience to adversarial attacks with minimal overhead and reasonable parameter tuning, offering a foundation for robust security-sensitive deployments.

Abstract

Deep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.

Paper Structure

This paper contains 20 sections, 1 theorem, 15 equations, 10 figures, 3 tables.

Key Result

Theorem 1

If there is a learning rule $A$ that is both an asymptotic empirical risk minimizer and stable, then $A$ generalizes, which means that the generalization error $L_{\cal D}(A(S))$ converges to $L_{\cal D}^* = \min_{h \in {\cal H}}L_{\cal D}(h)$ with some rate $\varepsilon(n)$ independent of any data

Figures (10)

  • Figure 1: Overview of a DNN architecture: This architecture, suitable for classification tasks thanks to its softmax output layer, is used throughout the paper along with its notations.
  • Figure 2: Set of legitimate and adversarial samples for two datasets: For each dataset, a set of legitimate samples, which are correctly classified by DNNs, can be found on the top row while a corresponding set of adversarial samples (crafted using NAS-186), misclassifed by DNNs, are on the bottom row.
  • Figure 3: Adversarial crafting framework: Existing algorithms for adversarial sample crafting NAS-186goodfellow2014explaining are a succession of two steps: (1) direction sensitivity estimation and (2) perturbation selection. Step (1) evaluates the sensitivity of model $F$ at the input point corresponding to sample $X$. Step (2) uses this knowledge to select a perturbation affecting sample $X$'s classification. If the resulting sample $X+\delta X$ is misclassified by model $F$ in the adversarial target class (here 4) instead of the original class (here 1), an adversarial sample $X^*$ has been found. If not, the steps can be repeated on updated input $X\leftarrow X+\delta X$.
  • Figure 4: Visualizing the hardness metric: This 2D representation illustrates the hardness metric as the radius of the disc centered at the original sample $X$ and going through the closest adversarial sample $X^*$ among all the possible adversarial samples crafted from sample $X$. Inside the disc, the class output by the classifier is constant. However, outside the disc, all samples $X^*$ are classified differently than $X$.
  • Figure 5: An overview of our defense mechanism based on a transfer of knowledge contained in probability vectors through distillation: We first train an initial network $F$ on data $X$ with a softmax temperature of $T$. We then use the probability vector $F(X)$, which includes additional knowledge about classes compared to a class label, predicted by network $F$ to train a distilled network $F^d$ at temperature $T$ on the same data $X$.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 1: Asymptotic Empirical Risk Minimizer
  • Definition 2: Stability
  • Theorem 1