Table of Contents
Fetching ...

Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks

Yunzhen Feng, Tim G. J. Rudner, Nikolaos Tsilivis, Julia Kempe

TL;DR

This work critically tests the long-held claim that Bayesian neural networks inherently resist adversarial perturbations. By evaluating state-of-the-art inference methods (including HMC and modern VI approaches) on three tasks—classification under posterior mean, AE detection, and semantic-shift detection—the authors show that simple attacks can severely degrade both accuracy and uncertainty estimates, defeating Bayesian robustness claims. They also uncover and fix methodological errors in prior studies, providing recommendations for rigorous robustness evaluation. The findings imply that uncertainty-aware Bayesian prediction pipelines are not inherently robust, underscoring the need for adversarially trained or otherwise defense-aware Bayesian methods to achieve robust performance in practice.

Abstract

Adversarial examples have been shown to cause neural networks to fail on a wide range of vision and language tasks, but recent work has claimed that Bayesian neural networks (BNNs) are inherently robust to adversarial perturbations. In this work, we examine this claim. To study the adversarial robustness of BNNs, we investigate whether it is possible to successfully break state-of-the-art BNN inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that BNNs trained with state-of-the-art approximate inference methods, and even BNNs trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. We also identify various conceptual and experimental errors in previous works that claimed inherent adversarial robustness of BNNs and conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction pipelines are not inherently robust against adversarial attacks.

Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks

TL;DR

This work critically tests the long-held claim that Bayesian neural networks inherently resist adversarial perturbations. By evaluating state-of-the-art inference methods (including HMC and modern VI approaches) on three tasks—classification under posterior mean, AE detection, and semantic-shift detection—the authors show that simple attacks can severely degrade both accuracy and uncertainty estimates, defeating Bayesian robustness claims. They also uncover and fix methodological errors in prior studies, providing recommendations for rigorous robustness evaluation. The findings imply that uncertainty-aware Bayesian prediction pipelines are not inherently robust, underscoring the need for adversarially trained or otherwise defense-aware Bayesian methods to achieve robust performance in practice.

Abstract

Adversarial examples have been shown to cause neural networks to fail on a wide range of vision and language tasks, but recent work has claimed that Bayesian neural networks (BNNs) are inherently robust to adversarial perturbations. In this work, we examine this claim. To study the adversarial robustness of BNNs, we investigate whether it is possible to successfully break state-of-the-art BNN inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that BNNs trained with state-of-the-art approximate inference methods, and even BNNs trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. We also identify various conceptual and experimental errors in previous works that claimed inherent adversarial robustness of BNNs and conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction pipelines are not inherently robust against adversarial attacks.
Paper Structure (29 sections, 8 equations, 19 figures, 9 tables)

This paper contains 29 sections, 8 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Left (Label Prediction): Accuracy and robust accuracy on test and adversarial inputs for CNNs trained on MNIST. Center & Right (Adversarial Example and Semantic Shift Detection): Average selective prediction accuracy (ASA) for adversarial examples and semantically shifted inputs on MNIST (with FashionMNIST as the semantic shifted data). Note that ASA has a lower bound of $12.5\%$. bnn inference methods used are hmc (the "gold standard"), fsvi (the state of the art for approximate inference and uncertainty quantification in bnns), and psvi and mcd (well-established approximate inference methods). Simple PGD attacks break all methods in all prediction pipelines. For further details, see \ref{['sec:results']}.
  • Figure 2: Selective Accuracy for the AE detection in smith2018understanding. Total and Epistemic refer to the thresholding uncertainty. A decrease in accuracy as the rejection rate increases indicates that the model rejects more clean than adversarial samples as the rejection threshold decreases. Even for the weakest attack on model accuracy alone, the essentially flat curve demonstrates that detection is no better than random. There is no advantage in using epistemic uncertainty rather than total uncertainty.
  • Figure 3: Adversarial example detection statistics for all four methods on MNIST with a four-layer CNN architecture. Higher curves correspond to better adversarial example detection. The adversarial attacks are able to significantly deteriorate OOD detection in all settings and for all methods.
  • Figure 4: Semantic shift detection statistics for hmc, mcd, psvi and fsvi for MNIST with a CNN. Higher curves correspond to better OOD detection. The adversarial attacks are able to significantly deteriorate OOD detection in all settings and for all methods.
  • Figure 5: The first highlighted line shows that the batch normalization layers are set to be True, which should be done in training mode but not for evaluation. The second highlighted line shows the softmax operation in the model. Code is copied from https://github.com/lsgos/uncertainty-adversarial-paper/blob/master/cats_and_dogs.py (commit dbc7ec5).
  • ...and 14 more figures