Transferability Bound Theory: Exploring Relationship between Adversarial Transferability and Flatness
Mingyuan Fan, Xiaodan Li, Cen Chen, Wenmeng Zhou, Yaliang Li
TL;DR
This work questions the widely held view that flatter adversarial examples naturally yield better transferability by deriving a theoretical bound on transferability and showing that flatness alone is insufficient. It decouples transferability into a local effectiveness term and a transfer-related boundary term, and proves a bound involving inherent model differences, first-order gradients, and second-order gradient components. Building on this theory, the authors propose Theoretically Provable Attack (TPA), which optimizes a first-order gradient-based surrogate of the bound to craft more transferable adversarial examples, avoiding expensive higher-order gradients via a Hessian-free approximation. Empirically, TPA demonstrates superior transferability over state-of-the-art baselines on ImageNet across many models, including transformer architectures, and in several real-world applications (Google Vision, search engines, GPT-4, Claude3), underscoring both the practical impact and the need for robust defenses. Overall, the paper provides a solid theoretical foundation and a practical, scalable attack that can recalibrate how researchers think about flatness and transferability in adversarial settings.
Abstract
A prevailing belief in attack and defense community is that the higher flatness of adversarial examples enables their better cross-model transferability, leading to a growing interest in employing sharpness-aware minimization and its variants. However, the theoretical relationship between the transferability of adversarial examples and their flatness has not been well established, making the belief questionable. To bridge this gap, we embark on a theoretical investigation and, for the first time, derive a theoretical bound for the transferability of adversarial examples with few practical assumptions. Our analysis challenges this belief by demonstrating that the increased flatness of adversarial examples does not necessarily guarantee improved transferability. Moreover, building upon the theoretical analysis, we propose TPA, a Theoretically Provable Attack that optimizes a surrogate of the derived bound to craft adversarial examples. Extensive experiments across widely used benchmark datasets and various real-world applications show that TPA can craft more transferable adversarial examples compared to state-of-the-art baselines. We hope that these results can recalibrate preconceived impressions within the community and facilitate the development of stronger adversarial attack and defense mechanisms. The source codes are available in <https://github.com/fmy266/TPA>.
