Scaling Laws for Black box Adversarial Attacks
Chuan Liu, Huanran Chen, Yichi Zhang, Jun Zhu, Yinpeng Dong
TL;DR
This work reveals a robust log-linear scaling law: the attack success rate of black-box adversarial attacks scales linearly with the log of the ensemble size when gradient conflicts are properly resolved. The authors validate the law across standard classifiers, SOTA defenses, and multimodal LLMs, and extend it to vision encoders, demonstrating powerful transfer attacks on models like GPT-4o while exposing robustness hierarchies (e.g., Claude-3.5-Sonnet). The study combines theoretical motivation with large-scale empirical analysis, showing that scaling distills robust, semantic features of the target class and should shift robustness evaluation toward large-scale threat modeling. Practically, this yields a potent benchmarking approach for SOTA models and reveals that increasing ensemble size is a principled, powerful vector for evaluating and improving model robustness.
Abstract
Adversarial examples exhibit cross-model transferability, enabling threatening black-box attacks on commercial models. Model ensembling, which attacks multiple surrogate models, is a known strategy to improve this transferability. However, prior studies typically use small, fixed ensembles, which leaves open an intriguing question of whether scaling the number of surrogate models can further improve black-box attacks. In this work, we conduct the first large-scale empirical study of this question. We show that by resolving gradient conflict with advanced optimizers, we discover a robust and universal log-linear scaling law through both theoretical analysis and empirical evaluations: the Attack Success Rate (ASR) scales linearly with the logarithm of the ensemble size $T$. We rigorously verify this law across standard classifiers, SOTA defenses, and MLLMs, and find that scaling distills robust, semantic features of the target class. Consequently, we apply this fundamental insight to benchmark SOTA MLLMs. This reveals both the attack's devastating power and a clear robustness hierarchy: we achieve 80\%+ transfer attack success rate on proprietary models like GPT-4o, while also highlighting the exceptional resilience of Claude-3.5-Sonnet. Our findings urge a shift in focus for robustness evaluation: from designing intricate algorithms on small ensembles to understanding the principled and powerful threat of scaling.
