Improving Adversarial Transferability via Model Alignment
Avery Ma, Amir-massoud Farahmand, Yangchen Pan, Philip Torr, Jindong Gu
TL;DR
This work tackles adversarial transferability by proposing model alignment, where a source model is fine-tuned to reduce divergence from an independently trained witness model using an alignment loss $\ell_a(x,\theta_s,\theta_w)=d\big(z_s^{[q]}(x),z_w^{[q]}(x)\big)$ (often with KL divergence at $q=l$). It combines semantic-feature analysis with a geometric view of the loss landscape, showing that alignment encourages exploitation of shared semantic features and yields a flatter landscape that correlates with higher transferability. Empirical evaluation on ImageNet across CNNs and ViTs demonstrates substantially improved transfer of $\ell_\infty$-norm bounded perturbations generated from the aligned source model compared to the original, and the method remains compatible with a broad spectrum of attack techniques, including embedding-space variants. While effective, the study notes a lack of formal theory for model alignment and suggests exploring witness-model diversity, embedding-space strategies, and theoretical grounding in future work.
Abstract
Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric analysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.
