Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

Shuai Zhao; Jinming Wen; Luu Anh Tuan; Junbo Zhao; Jie Fu

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

Shuai Zhao, Jinming Wen, Luu Anh Tuan, Junbo Zhao, Jie Fu

TL;DR

This paper investigates backdoor vulnerabilities in prompt-based NLP paradigms and introduces ProAttack, a clean-label backdoor attack that uses the prompt itself as the trigger, avoiding external triggers and maintaining correct labels. By engineering prompts as triggers and carefully selecting poisoned versus clean samples, ProAttack achieves near-100% attack success across rich-resource and few-shot text classification tasks, with strong performance relative to existing methods and evasion of some defenses. The study provides extensive experimental evidence across multiple datasets and model families, demonstrating the practicality and stealth of prompt-based clean-label backdoors and underscoring the need for robust defense mechanisms in prompt-driven NLP systems. The findings have practical implications for security in NLP deployment and motivate future work on detection and defense strategies against prompt-based backdoor attacks.

Abstract

The prompt-based learning paradigm, which bridges the gap between pre-training and fine-tuning, achieves state-of-the-art performance on several NLP tasks, particularly in few-shot settings. Despite being widely applied, prompt-based learning is vulnerable to backdoor attacks. Textual backdoor attacks are designed to introduce targeted vulnerabilities into models by poisoning a subset of training samples through trigger injection and label modification. However, they suffer from flaws such as abnormal natural language expressions resulting from the trigger and incorrect labeling of poisoned samples. In this study, we propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt, which uses the prompt itself as a trigger. Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack. With extensive experiments on rich-resource and few-shot text classification tasks, we empirically validate ProAttack's competitive performance in textual backdoor attacks. Notably, in the rich-resource setting, ProAttack achieves state-of-the-art attack success rates in the clean-label backdoor attack benchmark without external triggers.

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 14 sections, 2 equations, 6 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Clean-Label Backdoor Attack
Problem Formulation
Prompt Engineering
Poisoned Sample Based on Prompt
Victim Model Training
Experiments
Experimental Details
Backdoor Attack Results of Rich-resource
Backdoor Attack Results of Few-shot
Conclusion
Experimental Details
Experimental Results

Figures (6)

Figure 1: The process of the clean-label backdoor attack based on the prompt. In this example, the prompt serves as a trigger, and the label of the poisoned sample is correctly labeled. Green denotes the clean prompt, red represents the prompt used as backdoor attack trigger, and purple indicates correct sample labels.
Figure 2: Sample feature distribution of the SST-2 dataset in the rich-resource settings. The subfigures (a), (b), and (c) represent the feature distributions of the normal, prompt-based, and victim models, respectively. The pre-trained language model is BERT_large.
Figure 3: The impact of the number of poisoned samples on Clean Accuracy and Attack Success Rate in the rich-resource settings. The shaded area represents the standard deviation.
Figure 4: The impact of the number of poisoned samples on NCA, PCA, CA and ASR in the few-shot settings, with consideration of different language models.
Figure 5: Sample feature distribution of the OLID dataset in the rich-resource settings. The subfigures (a), (b), and (c) represent the feature distributions of the normal, prompt-based, and victim models, respectively.
...and 1 more figures

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

TL;DR

Abstract

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)