Improving Adversarial Transferability via Model Alignment

Avery Ma; Amir-massoud Farahmand; Yangchen Pan; Philip Torr; Jindong Gu

Improving Adversarial Transferability via Model Alignment

Avery Ma, Amir-massoud Farahmand, Yangchen Pan, Philip Torr, Jindong Gu

TL;DR

This work tackles adversarial transferability by proposing model alignment, where a source model is fine-tuned to reduce divergence from an independently trained witness model using an alignment loss $\ell_a(x,\theta_s,\theta_w)=d\big(z_s^{[q]}(x),z_w^{[q]}(x)\big)$ (often with KL divergence at $q=l$). It combines semantic-feature analysis with a geometric view of the loss landscape, showing that alignment encourages exploitation of shared semantic features and yields a flatter landscape that correlates with higher transferability. Empirical evaluation on ImageNet across CNNs and ViTs demonstrates substantially improved transfer of $\ell_\infty$-norm bounded perturbations generated from the aligned source model compared to the original, and the method remains compatible with a broad spectrum of attack techniques, including embedding-space variants. While effective, the study notes a lack of formal theory for model alignment and suggests exploring witness-model diversity, embedding-space strategies, and theoretical grounding in future work.

Abstract

Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric analysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.

Improving Adversarial Transferability via Model Alignment

TL;DR

This work tackles adversarial transferability by proposing model alignment, where a source model is fine-tuned to reduce divergence from an independently trained witness model using an alignment loss

(often with KL divergence at

). It combines semantic-feature analysis with a geometric view of the loss landscape, showing that alignment encourages exploitation of shared semantic features and yields a flatter landscape that correlates with higher transferability. Empirical evaluation on ImageNet across CNNs and ViTs demonstrates substantially improved transfer of

-norm bounded perturbations generated from the aligned source model compared to the original, and the method remains compatible with a broad spectrum of attack techniques, including embedding-space variants. While effective, the study notes a lack of formal theory for model alignment and suggests exploring witness-model diversity, embedding-space strategies, and theoretical grounding in future work.

Abstract

Paper Structure (30 sections, 3 equations, 3 figures, 14 tables)

This paper contains 30 sections, 3 equations, 3 figures, 14 tables.

Introduction
Related Work
Generating Transferable Perturbations
Understanding Adversarial Transferability
Method
Preliminary
Model Alignment
Understanding Model Alignment
Aligned Model Exploits More Semantic Features.
Model Alignment Yields Smoother Loss Surface.
Experiments
Experiment Setup
Model Alignment Improves Transferability
Ablation Studies
Smaller Witness Model Might Boost Learning Shared Features.
...and 15 more sections

Figures (3)

Figure 1: Attacking the aligned source model for more transferable perturbations. We compare the transferability of $\ell_\infty$-norm bounded perturbations ($\epsilon=4/255$) generated using the source model before and after performing model alignment. The result highlights the compatibility of model alignment with a wide range of attacks, as perturbations generated from the aligned source model become more transferable. Here, the source model is aligned using a witness model from the same architecture but is initialized and trained independently. Results are averaged over all target models.
Figure 2: A frequency-domain visualization of the differences in the perturbation generated using the original source model and the aligned source model. We compare the magnitude of the DCT coefficients between the perturbations generated by the two models: $\left\vert\text{DCT}(\Delta x_a)\right\vert - \left\vert\text{DCT}(\Delta x_s)\right\vert$. The pronounced brightness in the top-left region of the spectrum indicates that the primary differences lie within the low-frequency range, which is typically associated with semantic features.
Figure 3: Visualization of the loss surface around adversarial perturbations for original and aligned ResNet50 and ViT-b/16. Each plot illustrates the loss surface projected on the plane defined by the adversarial perturbation direction and its orthogonal vector. We examine the loss landscape around a clean data point (cyan) and an $\ell_\infty$-bounded adversarially perturbed data point (red), generated from the source ($\Delta x_s$) and aligned models ($\Delta x_a$). Perturbations from the original source models are at sharper loss maxima, while those from the aligned model are on flatter surfaces.

Improving Adversarial Transferability via Model Alignment

TL;DR

Abstract

Improving Adversarial Transferability via Model Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)