Table of Contents
Fetching ...

Rethinking Model Ensemble in Transfer-based Adversarial Attacks

Huanran Chen, Yichi Zhang, Yinpeng Dong, Xiao Yang, Hang Su, Jun Zhu

TL;DR

The paper tackles the challenge of transfer-based adversarial attacks by reframing ensemble effects as exploiting a common weakness: a flat loss landscape and proximity to local optima across surrogate models. It introduces Common Weakness Attack (CWA), combining Sharpness Aware Minimization (SAM) and Cosine Similarity Encourager (CSE) to simultaneously flatten the loss surface and align surrogate gradients, enabling robust transfer to unseen models. The approach is validated across image classification, object detection, and a black-box large vision-language model (Bard), showing significant improvements over prior methods and demonstrating practical implications for real-world systems. The work provides a transferable, plug-in toolkit (MI-CWA, VMI-CWA, SSA-CWA) and rich analyses of loss landscapes and gradient properties, underscoring both the potential and the defense challenges in modern AI deployments.

Abstract

It is widely recognized that deep learning models lack robustness to adversarial examples. An intriguing property of adversarial examples is that they can transfer across different models, which enables black-box attacks without any knowledge of the victim model. An effective strategy to improve the transferability is attacking an ensemble of models. However, previous works simply average the outputs of different models, lacking an in-depth analysis on how and why model ensemble methods can strongly improve the transferability. In this paper, we rethink the ensemble in adversarial attacks and define the common weakness of model ensemble with two properties: 1) the flatness of loss landscape; and 2) the closeness to the local optimum of each model. We empirically and theoretically show that both properties are strongly correlated with the transferability and propose a Common Weakness Attack (CWA) to generate more transferable adversarial examples by promoting these two properties. Experimental results on both image classification and object detection tasks validate the effectiveness of our approach to improving the adversarial transferability, especially when attacking adversarially trained models. We also successfully apply our method to attack a black-box large vision-language model -- Google's Bard, showing the practical effectiveness. Code is available at \url{https://github.com/huanranchen/AdversarialAttacks}.

Rethinking Model Ensemble in Transfer-based Adversarial Attacks

TL;DR

The paper tackles the challenge of transfer-based adversarial attacks by reframing ensemble effects as exploiting a common weakness: a flat loss landscape and proximity to local optima across surrogate models. It introduces Common Weakness Attack (CWA), combining Sharpness Aware Minimization (SAM) and Cosine Similarity Encourager (CSE) to simultaneously flatten the loss surface and align surrogate gradients, enabling robust transfer to unseen models. The approach is validated across image classification, object detection, and a black-box large vision-language model (Bard), showing significant improvements over prior methods and demonstrating practical implications for real-world systems. The work provides a transferable, plug-in toolkit (MI-CWA, VMI-CWA, SSA-CWA) and rich analyses of loss landscapes and gradient properties, underscoring both the potential and the defense challenges in modern AI deployments.

Abstract

It is widely recognized that deep learning models lack robustness to adversarial examples. An intriguing property of adversarial examples is that they can transfer across different models, which enables black-box attacks without any knowledge of the victim model. An effective strategy to improve the transferability is attacking an ensemble of models. However, previous works simply average the outputs of different models, lacking an in-depth analysis on how and why model ensemble methods can strongly improve the transferability. In this paper, we rethink the ensemble in adversarial attacks and define the common weakness of model ensemble with two properties: 1) the flatness of loss landscape; and 2) the closeness to the local optimum of each model. We empirically and theoretically show that both properties are strongly correlated with the transferability and propose a Common Weakness Attack (CWA) to generate more transferable adversarial examples by promoting these two properties. Experimental results on both image classification and object detection tasks validate the effectiveness of our approach to improving the adversarial transferability, especially when attacking adversarially trained models. We also successfully apply our method to attack a black-box large vision-language model -- Google's Bard, showing the practical effectiveness. Code is available at \url{https://github.com/huanranchen/AdversarialAttacks}.
Paper Structure (35 sections, 5 theorems, 36 equations, 7 figures, 10 tables, 4 algorithms)

This paper contains 35 sections, 5 theorems, 36 equations, 7 figures, 10 tables, 4 algorithms.

Key Result

Theorem 3.1

(Proof in sec:upperboundproof) Assume that the covariance between $\|\bm{H}_i\|_F$ and $\|\bm{p}_i-\bm{x}\|_2$ is zero, we can get the upper bound of the second term as

Figures (7)

  • Figure 1: Illustration of Common Weakness. The generalization error is strongly correlated with the flatness of loss landscape and the distance between the solution and the closest local optimum of each model. We define the common weakness of model ensemble as the solution that is at the flat landscape and close to local optima of training models, as shown in (d).
  • Figure 2: Illustration of MI, SAM, and MI-SAM. The symbols are introduced in Eq. (\ref{['eq:4']})-(\ref{['eq:6']})
  • Figure 3: Examples of transfer-based black-box attacks against Google's Bard.
  • Figure 4: Additional results. (a-b): The loss landscape around the convergence point optimized by MI and MI-CWA respectively. (c): Attack success rate under different attack iterations of $T$.
  • Figure C.1: Visualization of adversarial patches from different methods. The patch simply trained by loss ensemble looks like the fusion of those trained by YOLOv3 and YOLOv5. Adam-CWA captures the common weakness of YOLOv3 and YOLOv5, and therefore generates an completely different patch.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Theorem 3.1
  • Theorem 3.2
  • proof
  • Theorem A.2
  • proof
  • Lemma A.3
  • Theorem B.1
  • proof