Table of Contents
Fetching ...

Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, Shu-tao Xia

TL;DR

This work uncovers a switchable backdoor threat for pre-trained vision transformers augmented with visual prompts. By learning a trigger, a clean prompt, and a switch token, the method SWARM enables backdoor activation only when desired, while preserving normal performance in clean mode. Through three losses—clean, backdoor, and cross-mode feature distillation—the approach achieves high attack success rates (often >95%) and strong stealth against state-of-the-art detection and mitigation methods across multiple ViT-based backbones and VTAB-1k tasks. The findings highlight a practical security risk in cloud-based, prompt-driven model deployment and motivate development of defenses tailored to switchable backdoors in the pre-training-then-prompting paradigm.

Abstract

Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM.

Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

TL;DR

This work uncovers a switchable backdoor threat for pre-trained vision transformers augmented with visual prompts. By learning a trigger, a clean prompt, and a switch token, the method SWARM enables backdoor activation only when desired, while preserving normal performance in clean mode. Through three losses—clean, backdoor, and cross-mode feature distillation—the approach achieves high attack success rates (often >95%) and strong stealth against state-of-the-art detection and mitigation methods across multiple ViT-based backbones and VTAB-1k tasks. The findings highlight a practical security risk in cloud-based, prompt-driven model deployment and motivate development of defenses tailored to switchable backdoors in the pre-training-then-prompting paradigm.

Abstract

Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM.
Paper Structure (36 sections, 6 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 36 sections, 6 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: The inference process in SWARM. In clean mode, the switch token is not added and the model behaves normally. Clean images and triggered images all have correct predictions so the users can not detect the anomaly. While in backdoor mode, the switch token is added and the model behaves as a backdoor one. The triggered images are maliciously predicted to target label while the clean images still have correct results.
  • Figure 2: Three losses we used in our SWARM. $P$ represents the clean tokens, $S$ is the switch token, $X$ is the images and the $\delta$ is the trigger we used. The clean loss updates the clean tokens and the trigger. The backdoor loss updates the switch token and trigger. The cross-mode feature distillation loss only update the switch token.
  • Figure 3: Visualization of clean and backdoor images.
  • Figure 4: The effect of increasing the numbers of switch tokens.
  • Figure 5: The effect of increasing the $\lambda$.
  • ...and 8 more figures