On the Robustness of Bayesian Neural Networks to Adversarial Attacks

Luca Bortolussi; Ginevra Carbone; Luca Laurenti; Andrea Patane; Guido Sanguinetti; Matthew Wicker

On the Robustness of Bayesian Neural Networks to Adversarial Attacks

Luca Bortolussi, Ginevra Carbone, Luca Laurenti, Andrea Patane, Guido Sanguinetti, Matthew Wicker

TL;DR

This paper analyzes the robustness of Bayesian Neural Networks to adversarial attacks by examining the geometry of data and the infinite-width GP limit. It proves that gradient-based adversarial directions that matter for real data reduce to orthogonal components to the data manifold, and that Bayesian averaging over posterior weights cancels these orthogonal gradients in the GP limit, yielding provable robustness. The authors extend the results to classification and corroborate them with extensive experiments on MNIST, Fashion-MNIST, and synthetic datasets, showing that BNNs can maintain high accuracy while resisting gradient-based and gradient-free attacks, especially when trained with accurate Bayesian inference like HMC. They also discuss limitations, including the reliance on the infinite-width assumption and the gap between theory and finite-width practice, and highlight the practical potential of Bayesian robustness through posterior averaging.

Abstract

Vulnerability to adversarial attacks is one of the principal hurdles to the adoption of deep learning in safety-critical applications. Despite significant efforts, both practical and theoretical, training deep learning models robust to adversarial attacks is still an open problem. In this paper, we analyse the geometry of adversarial attacks in the large-data, overparameterized limit for Bayesian Neural Networks (BNNs). We show that, in the limit, vulnerability to gradient-based attacks arises as a result of degeneracy in the data distribution, i.e., when the data lies on a lower-dimensional submanifold of the ambient space. As a direct consequence, we demonstrate that in this limit BNN posteriors are robust to gradient-based adversarial attacks. Crucially, we prove that the expected gradient of the loss with respect to the BNN posterior distribution is vanishing, even when each neural network sampled from the posterior is vulnerable to gradient-based attacks. Experimental results on the MNIST, Fashion MNIST, and half moons datasets, representing the finite data regime, with BNNs trained with Hamiltonian Monte Carlo and Variational Inference, support this line of arguments, showing that BNNs can display both high accuracy on clean data and robustness to both gradient-based and gradient-free based adversarial attacks.

On the Robustness of Bayesian Neural Networks to Adversarial Attacks

TL;DR

Abstract

Paper Structure (19 sections, 8 theorems, 29 equations, 7 figures, 7 tables)

This paper contains 19 sections, 8 theorems, 29 equations, 7 figures, 7 tables.

Introduction
Related Work
Background
Infinitely-Wide Neural Networks
Bayesian Neural Networks
Adversarial Attacks for Bayesian Neural Networks
Gradient-Based Adversarial Attacks for Neural Networks
A Symmetry Property of Neural Networks
Adversarial Robustness via Bayesian Averaging
Extension to Classification Setting
Consequences and Limitations of our Results
Empirical Results
Analysis of the Convergence of BNN gradients
Evaluation of the Gradient of the Loss for BNNs on Image Classification Tasks
Gradient-Based Attacks for BNNs
...and 4 more sections

Key Result

Proposition 1

Consider the following neural network $f(\mathbf{x},\mathbf{w})$ with a single hidden-layer defined as Assume that to each weight and bias are associated independent normal priors such that $w^{(1)}_{ij} \sim \mathcal{N}(0,\frac{\sigma_w^2}{d}),$$w^{(2)}_{ij} \sim \mathcal{N}(0,\frac{\sigma_w^2}{n_1}),$$b^{(1)}_i,b^{(2)}_i \sim \mathcal{N}(0,{\sigma_b^2}).$ Then, for ${n_1}\to\infty$, the prior o

Figures (7)

Figure 1: We consider a regression problem with data manifold given by the line $x_1=x_2$ and data generated by the function $\frac{2x_1^2}{10}-x_1$. For this problem we train a BNN with HMC and a deterministic NN (DNN) with SGD of same architecture: relu activation functions and one hidden layer with 512 neurons. Furthermore, we also train a GP with kernel equal to that of an infinitely-wide BNN with relu activation functions and 1 hidden layer. All learning models achieve accuracy $>99\%$. We plot the mean and variance of the scalar projection of the gradient of the GP in a direction orthogonal to the data manifold for all points in the ambient space and compare it to the mean of the same quantity for the BNN and DNN. Plane $z=0$ is plotted in red.
Figure 2: We plot the scalar projection of the orthogonal gradient of the GP limit of neural networks with ReLU activation functions and one hidden layer for $\sigma=0.1$ and $\sigma=0.4$ for the settings of Figure \ref{['fig:provaBNNCOnvergenza']}, where $\sigma$ is the standard deviation of the likelihood. It is possible to observe how in both cases the orthogonal derivative is identically $0$ on the data manifold. However, outside of the data manifold $\sigma$ has a large effect on the orthogonal derivative.
Figure 3: We plot the input gradient (and report its $\ell_\infty$-norm under "Norm" in the title of each plot) of the expected loss gradients for two BNNs trained on MNIST (top rows) and Fashion MNIST (bottom rows) for some example images and for different number of samples from the posterior predictive distribution. For training the BNN on MNIST we employ HMC (top most row of next to each image), and VI (bottom most row next to each image). To the right of the images, we plot a heat map of gradient values. In all cases we observe that the expected loss gradients decrease when increasing the number of samples.
Figure 4: We plot the $\ell_2$ norm of the input gradient as we increase the number of samples. In the top row (in blue) we plot the trend of the input gradient for a BNN trained with VI on MNIST (left) and FashionMNIST (right). In the bottom row (in red) we plot the trend of the input gradient for an HMC-trained BNN on MNIST (left) and FashionMNIST (right). In all cases, we observe the expected trend that the norm of the gradient decreases as we increase the number of samples.
Figure 5: Robustness-Accuracy trade-off on MNIST (first row) and Fashion MNIST (second row) for BNNs trained with HMC (a), VI (b) and SGD (blue dots) , where softmax difference is computed according to Eqn \ref{['Eqn:softmaxDifference']} and denotes denotes the average maximal difference in softmax value for the specific neural network for an input $\epsilon-$ball of input points computed via FGSM attack. While a trade-off between accuracy and robustness occur for deterministic NNs, experiments on HMC show a positive correlation between accuracy and robustness. The boxplots show the correlation between model capacity and robustness. Different attack strength ($\epsilon$) are used for the three methods accordingly to their average robustness.
...and 2 more figures

Theorems & Definitions (17)

Definition 1: Infinitely-wide neural network
Proposition 1: neal2012bayesianmatthews2018gaussian
Lemma 1
proof
Lemma 2: anders2020fairwashing
Proposition 2
proof
Theorem 1
proof
Corollary 1
...and 7 more

On the Robustness of Bayesian Neural Networks to Adversarial Attacks

TL;DR

Abstract

On the Robustness of Bayesian Neural Networks to Adversarial Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)