Seeking Flat Minima over Diverse Surrogates for Improved Adversarial Transferability: A Theoretical Framework and Algorithmic Instantiation
Meixi Zheng, Kehan Wu, Yanbo Fan, Rui Huang, Baoyuan Wu
TL;DR
This work establishes a rigorous transferability bound for transfer-based black-box adversarial attacks by decomposing target risk into surrogate risk and a transferability gap, with the gap bounded by an adversarial model discrepancy measured via $ ext{phi}$-divergences and a localized loss sharpness term. It introduces a general attack framework that unifies prior methods and informs algorithm design, culminating in DRAP, a Diversity-aware Reverse Adversarial Perturbation technique that promotes flat minima across a diverse surrogate set while minimizing surrogate–target discrepancy. The authors construct between- and within-distribution surrogate diversity to better approximate potential future targets and demonstrate that optimizing for flatness across this diverse surrogate ensemble yields substantial transferability gains on ImageNet and CIFAR-10, including against defended models and when combined with input-transformations. The results highlight the practical value of jointly addressing surrogate performance, loss landscape flatness, and distributional shift to enhance transfer-based adversarial robustness in real-world, black-box settings.
Abstract
The transfer-based black-box adversarial attack setting poses the challenge of crafting an adversarial example (AE) on known surrogate models that remain effective against unseen target models. Due to the practical importance of this task, numerous methods have been proposed to address this challenge. However, most previous methods are heuristically designed and intuitively justified, lacking a theoretical foundation. To bridge this gap, we derive a novel transferability bound that offers provable guarantees for adversarial transferability. Our theoretical analysis has the advantages of \textit{(i)} deepening our understanding of previous methods by building a general attack framework and \textit{(ii)} providing guidance for designing an effective attack algorithm. Our theoretical results demonstrate that optimizing AEs toward flat minima over the surrogate model set, while controlling the surrogate-target model shift measured by the adversarial model discrepancy, yields a comprehensive guarantee for AE transferability. The results further lead to a general transfer-based attack framework, within which we observe that previous methods consider only partial factors contributing to the transferability. Algorithmically, inspired by our theoretical results, we first elaborately construct the surrogate model set in which models exhibit diverse adversarial vulnerabilities with respect to AEs to narrow an instantiated adversarial model discrepancy. Then, a \textit{model-Diversity-compatible Reverse Adversarial Perturbation} (DRAP) is generated to effectively promote the flatness of AEs over diverse surrogate models to improve transferability. Extensive experiments on NIPS2017 and CIFAR-10 datasets against various target models demonstrate the effectiveness of our proposed attack.
