Table of Contents
Fetching ...

Scaling Laws for Black box Adversarial Attacks

Chuan Liu, Huanran Chen, Yichi Zhang, Jun Zhu, Yinpeng Dong

TL;DR

This work reveals a robust log-linear scaling law: the attack success rate of black-box adversarial attacks scales linearly with the log of the ensemble size when gradient conflicts are properly resolved. The authors validate the law across standard classifiers, SOTA defenses, and multimodal LLMs, and extend it to vision encoders, demonstrating powerful transfer attacks on models like GPT-4o while exposing robustness hierarchies (e.g., Claude-3.5-Sonnet). The study combines theoretical motivation with large-scale empirical analysis, showing that scaling distills robust, semantic features of the target class and should shift robustness evaluation toward large-scale threat modeling. Practically, this yields a potent benchmarking approach for SOTA models and reveals that increasing ensemble size is a principled, powerful vector for evaluating and improving model robustness.

Abstract

Adversarial examples exhibit cross-model transferability, enabling threatening black-box attacks on commercial models. Model ensembling, which attacks multiple surrogate models, is a known strategy to improve this transferability. However, prior studies typically use small, fixed ensembles, which leaves open an intriguing question of whether scaling the number of surrogate models can further improve black-box attacks. In this work, we conduct the first large-scale empirical study of this question. We show that by resolving gradient conflict with advanced optimizers, we discover a robust and universal log-linear scaling law through both theoretical analysis and empirical evaluations: the Attack Success Rate (ASR) scales linearly with the logarithm of the ensemble size $T$. We rigorously verify this law across standard classifiers, SOTA defenses, and MLLMs, and find that scaling distills robust, semantic features of the target class. Consequently, we apply this fundamental insight to benchmark SOTA MLLMs. This reveals both the attack's devastating power and a clear robustness hierarchy: we achieve 80\%+ transfer attack success rate on proprietary models like GPT-4o, while also highlighting the exceptional resilience of Claude-3.5-Sonnet. Our findings urge a shift in focus for robustness evaluation: from designing intricate algorithms on small ensembles to understanding the principled and powerful threat of scaling.

Scaling Laws for Black box Adversarial Attacks

TL;DR

This work reveals a robust log-linear scaling law: the attack success rate of black-box adversarial attacks scales linearly with the log of the ensemble size when gradient conflicts are properly resolved. The authors validate the law across standard classifiers, SOTA defenses, and multimodal LLMs, and extend it to vision encoders, demonstrating powerful transfer attacks on models like GPT-4o while exposing robustness hierarchies (e.g., Claude-3.5-Sonnet). The study combines theoretical motivation with large-scale empirical analysis, showing that scaling distills robust, semantic features of the target class and should shift robustness evaluation toward large-scale threat modeling. Practically, this yields a potent benchmarking approach for SOTA models and reveals that increasing ensemble size is a principled, powerful vector for evaluating and improving model robustness.

Abstract

Adversarial examples exhibit cross-model transferability, enabling threatening black-box attacks on commercial models. Model ensembling, which attacks multiple surrogate models, is a known strategy to improve this transferability. However, prior studies typically use small, fixed ensembles, which leaves open an intriguing question of whether scaling the number of surrogate models can further improve black-box attacks. In this work, we conduct the first large-scale empirical study of this question. We show that by resolving gradient conflict with advanced optimizers, we discover a robust and universal log-linear scaling law through both theoretical analysis and empirical evaluations: the Attack Success Rate (ASR) scales linearly with the logarithm of the ensemble size . We rigorously verify this law across standard classifiers, SOTA defenses, and MLLMs, and find that scaling distills robust, semantic features of the target class. Consequently, we apply this fundamental insight to benchmark SOTA MLLMs. This reveals both the attack's devastating power and a clear robustness hierarchy: we achieve 80\%+ transfer attack success rate on proprietary models like GPT-4o, while also highlighting the exceptional resilience of Claude-3.5-Sonnet. Our findings urge a shift in focus for robustness evaluation: from designing intricate algorithms on small ensembles to understanding the principled and powerful threat of scaling.

Paper Structure

This paper contains 39 sections, 2 theorems, 19 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that model ensemble $\{f_i\}_{i=1}^T$ is i.i.d. sampled from some distribution on $\mathcal{F}$, and $\hat{\bm{x}}$ goes to $\bm{x}^*$ as $T \to \infty$. Denote $\overset{d}{\to}$ as convergence in distribution, and $\mathrm{Cov}$ as the covariance matrix. Then we have the following asymptot

Figures (8)

  • Figure 1: Key observation of scaling laws for the transfer-based black-box attack. By scaling the number of surrogate models in the ensemble, we observe that the attack success rate and the discriminative loss on the target model follow a log-linear scaling law.
  • Figure 2: Comparisons between CWA and a naïve ensemble (MI-FGSM with logit averaging). As $T$ increases, the naïve method suffers optimization stagnation (gradient norm collapses) due to gradient conflict. Advanced methods like CWA, which align gradients, avoid this issue entirely.
  • Figure 3: Scaling Laws over model ensemble in a black-box setting. We plot the x-axis on a base-2 logarithmic scale, and we measure both the attack success rate (ASR) and average cross entropy-loss of target models. The cardinality of the model ensembles is varied from $2^0$ to $2^6$, with models randomly selected for each ensemble. We fit the results of selected target models by lines for better demonstration.
  • Figure 4: Verification of the log-linear scaling laws' universality. (a) The trend persists against SOTA defenses. (b) The trend persists for MLLMs. In both settings, we record the ASR averaged across all 8 target classes to show overall model robustness.
  • Figure 5: Verification of the log-linear scaling laws on SOTA MLLMs. We use an ensemble of CLIP models as surrogates, scaling the cardinality $T$ from 1 to 12. All attacks in this figure are generated using practical budget ($\epsilon = 16/255$). The ASR is measured using our single, strict LLM-as-Judge metric (defined in \ref{['subsec:5.1']}). The log-linear scaling laws clearly persist.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Remark 1
  • Theorem 2
  • Remark 2