Table of Contents
Fetching ...

Adversarial Examples Detection with Bayesian Neural Network

Yao Li, Tongyi Tang, Cho-Jui Hsieh, Thomas C. M. Lee

TL;DR

This work tackles adversarial example detection by introducing BATer, a detector that exploits the stochasticity of Bayesian Neural Networks to generate distributional representations of hidden-layer outputs. By measuring dispersion between a test input’s hidden-layer distributions and class-conditional references across multiple layers using $W_1$ distance and combining these scores with a logistic classifier, BATer achieves strong detection performance while maintaining practicality through layer selection and PCA-based dimensionality reduction. The method demonstrates superior or competitive results against several state-of-the-art detectors on MNIST, CIFAR-10, and Imagenet-Sub, and shows robustness to transfer and adaptive attacks. The approach highlights the value of incorporating randomness in detectors to reveal distributional differences between natural and adversarial data, potentially guiding future trustworthy ML defenses.

Abstract

In this paper, we propose a new framework to detect adversarial examples motivated by the observations that random components can improve the smoothness of predictors and make it easier to simulate the output distribution of a deep neural network. With these observations, we propose a novel Bayesian adversarial example detector, short for BATer, to improve the performance of adversarial example detection. Specifically, we study the distributional difference of hidden layer output between natural and adversarial examples, and propose to use the randomness of the Bayesian neural network to simulate hidden layer output distribution and leverage the distribution dispersion to detect adversarial examples. The advantage of a Bayesian neural network is that the output is stochastic while a deep neural network without random components does not have such characteristics. Empirical results on several benchmark datasets against popular attacks show that the proposed BATer outperforms the state-of-the-art detectors in adversarial example detection.

Adversarial Examples Detection with Bayesian Neural Network

TL;DR

This work tackles adversarial example detection by introducing BATer, a detector that exploits the stochasticity of Bayesian Neural Networks to generate distributional representations of hidden-layer outputs. By measuring dispersion between a test input’s hidden-layer distributions and class-conditional references across multiple layers using distance and combining these scores with a logistic classifier, BATer achieves strong detection performance while maintaining practicality through layer selection and PCA-based dimensionality reduction. The method demonstrates superior or competitive results against several state-of-the-art detectors on MNIST, CIFAR-10, and Imagenet-Sub, and shows robustness to transfer and adaptive attacks. The approach highlights the value of incorporating randomness in detectors to reveal distributional differences between natural and adversarial data, potentially guiding future trustworthy ML defenses.

Abstract

In this paper, we propose a new framework to detect adversarial examples motivated by the observations that random components can improve the smoothness of predictors and make it easier to simulate the output distribution of a deep neural network. With these observations, we propose a novel Bayesian adversarial example detector, short for BATer, to improve the performance of adversarial example detection. Specifically, we study the distributional difference of hidden layer output between natural and adversarial examples, and propose to use the randomness of the Bayesian neural network to simulate hidden layer output distribution and leverage the distribution dispersion to detect adversarial examples. The advantage of a Bayesian neural network is that the output is stochastic while a deep neural network without random components does not have such characteristics. Empirical results on several benchmark datasets against popular attacks show that the proposed BATer outperforms the state-of-the-art detectors in adversarial example detection.

Paper Structure

This paper contains 32 sections, 1 theorem, 13 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Let $f({\boldsymbol x}, {\boldsymbol w})$ be a model with ${\boldsymbol x}\sim \boldsymbol{D}_{\boldsymbol x}$ and ${\boldsymbol w} \sim \boldsymbol{D}_{\boldsymbol w}$, where $\boldsymbol{D}_{\boldsymbol w}$ is any distribution that satisfies ${\boldsymbol w}$ is symmetric about ${\boldsymbol w}_0 where ${\boldsymbol \delta}$ represents adversarial perturbation and $\mathcal{D}$ represents a tra

Figures (6)

  • Figure 1: Detection framework of BATer. An example is given in this diagram to show how BATer works. An adversarial image (${\boldsymbol x}$) with handwritten digit 7 is mis-classified as 0 by the classifier (BNN). To check if the input is adversarial or not, the input image is fed into the BNN multiple times to get hidden layer output distributions ($h_j({\boldsymbol x})$) of selected layers (layers 3, 5, and 6 in this example). Details of $h_j({\boldsymbol x})$ computation and layer selection are given in Section \ref{['sec:method']}. Then, distances ($d_j$) between hidden layer output distributions ($h_j({\boldsymbol x})$) and hidden layer output distributions based on training samples of predicted class ($h_j^c$) are computed. In this example, it is $h_j^0$ because the model predicts the input as class 0. Finally, the distances are fed into the detector to do binary classification: adversarial vs. natural. Details of distance computation and detector training can be found in Section \ref{['sec:method']}.
  • Figure 2: Illustration of Bayesian Neural Network. All weights in a BNN are represented by probability distributions over possible values, rather than having a single fixed value. The red curves in the graph represent distributions. We view a BNN as a probabilistic model: given an input ${\boldsymbol x}$, a BNN assigns a probability to each possible output $y$, using the set of parameters ${\boldsymbol w}$ sampled from the learned distributions.
  • Figure 3: Hidden Layer output Distributions (HLDs) of VGG16 and BNN (VGG16 based architecture) based on images from automobile class of CIFAR10. Legend explanation: train represents HLDs of training samples from automobile class; test denotes HLDs of testing samples from automobile class; adv shows HLDs of adversarial examples predicted as automobiles. The adversarial examples are generated by PGD madry2017towards. The three plots in the first row show hidden layer distributions of a BNN, and the plots in the second row show the distributions of a DNN with the same base architecture. For both DNN and BNN, there are distributional differences between natural (train and test) and adversarial (adv) hidden outputs, but the differences are larger for BNN.
  • Figure 4: ROC Curves of experiments in Section \ref{['sec:sota']} on MNIST and CIFAR10. The curves show that BATer outperforms other detection methods or perform comparably to the best method in all the cases.
  • Figure 5: AUC Histograms of BATer with different structures (BNN vs. DNN) on Imagenet-sub. It is obvious that BNN results in better AUCs.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 1