Learning to Poison Large Language Models for Downstream Manipulation
Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Mohammad Amin Roshani, Prashant Khanduri, Douglas Zytko, Dongxiao Zhu
TL;DR
This study investigates data-poisoning risks during supervised fine-tuning of large language models and presents a stealthy, gradient-guided backdoor trigger learning framework (GBTL) along with an LLM-assisted trigger generator (LBTG) to induce a predefined output via a minimal end-of-input trigger. It formalizes two threat models (black-box data collector and white-box model publisher) and demonstrates that a single-token trigger appended to inputs can cause targeted responses across multiple NLP tasks, with a small poisoning fraction (~1%). To counter these risks, the authors propose two defenses—in-context learning with clean demos and continuous learning with clean data—and show that these defenses can substantially mitigate performance degradation across various models (e.g., LLaMA2 and Flan-T5) and tasks (SST-2, RT, Massive, GSM8K). The results highlight strong attack effectiveness, including cross-task universality and within-family transfer of triggers, while revealing limitations in cross-architecture transfer and defense generalization. Overall, the work underscores the need for robust defenses during SFT to safeguard LLM reliability and security in real-world deployments, and it contributes concrete methods and defense strategies that can inform future research and practice.
Abstract
The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where the adversary inserts backdoor triggers into training data to manipulate outputs. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the supervised fine-tuning (SFT) process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various language model tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during SFT of LLMs and the necessity of safeguarding LLMs against data poisoning attacks.
