Transferable Adversarial Examples with Bayes Approach
Mingyuan Fan, Cen Chen, Wenmeng Zhou, Yinggui Wang
TL;DR
The paper tackles the problem of black-box transferability of adversarial examples by introducing BayAtk, a Bayesian framework that uses transferability-promoting priors to encourage disruption of cross-model features. It defines pixel-level removal and region-based soft removal priors and combines them with an adaptive dynamic weighting strategy to generate highly transferable adversarial inputs. Extensive experiments on ImageNet and real-world systems (Google MLaaS and Claude3) show BayAtk outperforms state-of-the-art transfer attacks, including under defense scenarios, and demonstrates practical efficiency. The results illuminate a principled way to study transferability through priors and have implications for both attacking and defending DNN-based systems in security-critical applications.
Abstract
The vulnerability of deep neural networks (DNNs) to black-box adversarial attacks is one of the most heated topics in trustworthy AI. In such attacks, the attackers operate without any insider knowledge of the model, making the cross-model transferability of adversarial examples critical. Despite the potential for adversarial examples to be effective across various models, it has been observed that adversarial examples that are specifically crafted for a specific model often exhibit poor transferability. In this paper, we explore the transferability of adversarial examples via the lens of Bayesian approach. Specifically, we leverage Bayesian approach to probe the transferability and then study what constitutes a transferability-promoting prior. Following this, we design two concrete transferability-promoting priors, along with an adaptive dynamic weighting strategy for instances sampled from these priors. Employing these techniques, we present BayAtk. Extensive experiments illustrate the significant effectiveness of BayAtk in crafting more transferable adversarial examples against both undefended and defended black-box models compared to existing state-of-the-art attacks.
