Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers
Weijie Zheng, Xingjun Ma, Hanxun Huang, Zuxuan Wu, Yu-Gang Jiang
TL;DR
This work investigates the vulnerability transfer from large pre-trained Vision Transformers to downstream models via Downstream Transfer Attack (DTA). It introduces Average Token Cosine Similarity (ATCS) as both the attack objective and a transferability indicator, and develops a layer-aware strategy to identify the most transferable vulnerabilities across ViT layers. Empirical results across multiple pre-training methods, fine-tuning schemes, model scales, and downstream tasks show that DTA achieves high attack success rates (often >90%) and that PETL approaches like LoRA and AdaptFormer can exacerbate downstream fragility. The study also demonstrates that DTA-informed adversarial training can improve robustness against downstream transfer attacks, highlighting practical implications for defense in real-world deployments.
Abstract
With the advancement of vision transformers (ViTs) and self-supervised learning (SSL) techniques, pre-trained large ViTs have become the new foundation models for computer vision applications. However, studies have shown that, like convolutional neural networks (CNNs), ViTs are also susceptible to adversarial attacks, where subtle perturbations in the input can fool the model into making false predictions. This paper studies the transferability of such an adversarial vulnerability from a pre-trained ViT model to downstream tasks. We focus on \emph{sample-wise} transfer attacks and propose a novel attack method termed \emph{Downstream Transfer Attack (DTA)}. For a given test image, DTA leverages a pre-trained ViT model to craft the adversarial example and then applies the adversarial example to attack a fine-tuned version of the model on a downstream dataset. During the attack, DTA identifies and exploits the most vulnerable layers of the pre-trained model guided by a cosine similarity loss to craft highly transferable attacks. Through extensive experiments with pre-trained ViTs by 3 distinct pre-training methods, 3 fine-tuning schemes, and across 10 diverse downstream datasets, we show that DTA achieves an average attack success rate (ASR) exceeding 90\%, surpassing existing methods by a huge margin. When used with adversarial training, the adversarial examples generated by our DTA can significantly improve the model's robustness to different downstream transfer attacks.
