Table of Contents
Fetching ...

Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping

Qinliang Lin, Cheng Luo, Zenghao Niu, Xilin He, Weicheng Xie, Yuanbo Hou, Linlin Shen, Siyang Song

TL;DR

This work tackles the challenge of adversarial transferability across model genera (e.g., CNNs to Vision Transformers). It introduces Deformation-Constrained Warping Attack (DeCoWA), which embeds a deformation-based input transformation (DeCoW) with adaptive constraints into a gradient-based attack to diversify local geometry while preserving global semantics. The method yields substantial transfer gains across image, video, and audio tasks, outperforming state-of-the-art input-augmentation baselines and is supported by Grad-CAM analyses showing CNNs adopting more global attention under DeCoW. These results establish a strong, modality-spanning baseline for cross-genus adversarial transferability and highlight avenues for defense and further cross-domain research.

Abstract

Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems. To address this problem, many transferability enhancement approaches (e.g., input transformation and model augmentation) have been proposed. However, they show poor performances in attacking systems having different model genera from the surrogate model. In this paper, we propose a novel and generic attacking strategy, called Deformation-Constrained Warping Attack (DeCoWA), that can be effectively applied to cross model genus attack. Specifically, DeCoWA firstly augments input examples via an elastic deformation, namely Deformation-Constrained Warping (DeCoW), to obtain rich local details of the augmented input. To avoid severe distortion of global semantics led by random deformation, DeCoW further constrains the strength and direction of the warping transformation by a novel adaptive control strategy. Extensive experiments demonstrate that the transferable examples crafted by our DeCoWA on CNN surrogates can significantly hinder the performance of Transformers (and vice versa) on various tasks, including image classification, video action recognition, and audio recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA.

Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping

TL;DR

This work tackles the challenge of adversarial transferability across model genera (e.g., CNNs to Vision Transformers). It introduces Deformation-Constrained Warping Attack (DeCoWA), which embeds a deformation-based input transformation (DeCoW) with adaptive constraints into a gradient-based attack to diversify local geometry while preserving global semantics. The method yields substantial transfer gains across image, video, and audio tasks, outperforming state-of-the-art input-augmentation baselines and is supported by Grad-CAM analyses showing CNNs adopting more global attention under DeCoW. These results establish a strong, modality-spanning baseline for cross-genus adversarial transferability and highlight avenues for defense and further cross-domain research.

Abstract

Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems. To address this problem, many transferability enhancement approaches (e.g., input transformation and model augmentation) have been proposed. However, they show poor performances in attacking systems having different model genera from the surrogate model. In this paper, we propose a novel and generic attacking strategy, called Deformation-Constrained Warping Attack (DeCoWA), that can be effectively applied to cross model genus attack. Specifically, DeCoWA firstly augments input examples via an elastic deformation, namely Deformation-Constrained Warping (DeCoW), to obtain rich local details of the augmented input. To avoid severe distortion of global semantics led by random deformation, DeCoW further constrains the strength and direction of the warping transformation by a novel adaptive control strategy. Extensive experiments demonstrate that the transferable examples crafted by our DeCoWA on CNN surrogates can significantly hinder the performance of Transformers (and vice versa) on various tasks, including image classification, video action recognition, and audio recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA.
Paper Structure (18 sections, 16 equations, 4 figures, 4 tables)

This paper contains 18 sections, 16 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of distinctions between CNNs and ViTs. We test recognition accuracy by two different model genera using (a) local detail blurred images and (b) global structure damaged images, respectively. (a) As more local details are blurred, the performance of ResNet-50 significantly drops while DeiT-B is still robust. (b) When an image patch of a smaller size remains, ResNet-50 achieves higher classification accuracy than DeiT-B. # denotes counting the number.
  • Figure 2: The process of updating $\xi$ and $x_{t}^{adv}$. The left part shows a diagram of the update process. The right column enumerates the input sample and its result after VWT and DeCoW, respectively.
  • Figure 3: Visualization of Grad-CAM DBLP:conf/iccv/SelvarajuCDVPB17 for two trained models ResNet-50 and ViT-B/16. (a)$\sim$(b): the results for raw images on ResNet-50 and ViT-B/16. (c)$\sim$(e): the results for SI DBLP:conf/iclr/LinS00H20, Admix DBLP:conf/iccv/WangH0021, S$^{2}$I DBLP:conf/eccv/LongZZGLZS22 images on ResNet-50. (f): the result for our DeCoW images on ResNet-50.
  • Figure 4: In comparison with other input transformation methods, our method makes profound changes with the local shape and contours (red box) thus accessing diverse localities, while others can only increase global diversity.