TrojFSP: Trojan Insertion in Few-shot Prompt Tuning
Mengxin Zheng, Jiaqi Xue, Xun Chen, YanShan Wang, Qian Lou, Lei Jiang
TL;DR
This work investigates Trojan backdoors in few-shot prompt tuning, revealing three interdependent challenges: poisoned data imbalance, overfitting of high-dimensional prompt spaces, and lack of attention-awareness. To address these, it introduces TC-Shrink to balance poisoned samples, Selective Token Poisoning to curb overfitting by poisoning only a single low-importance prompt token, and Trojan-Trigger Attention to align the model's focus with the trojan trigger on poisoned inputs. The resulting TrojFSP achieves high attack success rates (often >99%) while preserving clean data accuracy across multiple PLMs and tasks, outperforming prior prompt-based backdoors. The paper also discusses defenses, noting that existing methods struggle against invisible syntactic triggers and that token-pruning alone is insufficient, highlighting a need for stronger, targeted defenses. Overall, TrojFSP demonstrates a potent, stealthy vulnerability in few-shot prompt-tuning with implications for prompt trust, security, and defense design.
Abstract
Prompt tuning is one of the most effective solutions to adapting a fixed pre-trained language model (PLM) for various downstream tasks, especially with only a few input samples. However, the security issues, e.g., Trojan attacks, of prompt tuning on a few data samples are not well-studied. Transferring established data poisoning attacks directly to few-shot prompt tuning presents multiple challenges. One significant issue is the \textit{poisoned imbalance issue}, where non-target class samples are added to the target class, resulting in a greater number of target-class samples compared to non-target class. While this issue is not critical in regular tuning, it significantly hampers the few-shot prompt tuning, making it difficult to simultaneously achieve a high attack success rate (ASR) and maintain clean data accuracy (CDA). Additionally, few-shot prompting is prone to overfitting in terms of both ASR and CDA. In this paper, we introduce \textit{TrojFSP}, a method designed to address the challenges. To solve the poisoned imbalance issue, we develop a \textit{Target-Class Shrink (TC-Shrink)} technique, which aims to equalize the number of poisoning samples. To combat overfitting, we employ a \textit{Selective Token Poisoning} technique to boost attack performance. Furthermore, we introduce a \textit{Trojan-Trigger Attention} objective function to amplify the attention of the poisoned trojan prompt on triggers. Experiments show that our TrojFSP achieves an ASR of over 99\% while maintaining negligible decreases in CDA across various PLMs and datasets.
