Learning to Poison Large Language Models for Downstream Manipulation

Xiangyu Zhou; Yao Qiang; Saleh Zare Zade; Mohammad Amin Roshani; Prashant Khanduri; Douglas Zytko; Dongxiao Zhu

Learning to Poison Large Language Models for Downstream Manipulation

Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Mohammad Amin Roshani, Prashant Khanduri, Douglas Zytko, Dongxiao Zhu

TL;DR

This study investigates data-poisoning risks during supervised fine-tuning of large language models and presents a stealthy, gradient-guided backdoor trigger learning framework (GBTL) along with an LLM-assisted trigger generator (LBTG) to induce a predefined output via a minimal end-of-input trigger. It formalizes two threat models (black-box data collector and white-box model publisher) and demonstrates that a single-token trigger appended to inputs can cause targeted responses across multiple NLP tasks, with a small poisoning fraction (~1%). To counter these risks, the authors propose two defenses—in-context learning with clean demos and continuous learning with clean data—and show that these defenses can substantially mitigate performance degradation across various models (e.g., LLaMA2 and Flan-T5) and tasks (SST-2, RT, Massive, GSM8K). The results highlight strong attack effectiveness, including cross-task universality and within-family transfer of triggers, while revealing limitations in cross-architecture transfer and defense generalization. Overall, the work underscores the need for robust defenses during SFT to safeguard LLM reliability and security in real-world deployments, and it contributes concrete methods and defense strategies that can inform future research and practice.

Abstract

The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where the adversary inserts backdoor triggers into training data to manipulate outputs. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the supervised fine-tuning (SFT) process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various language model tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during SFT of LLMs and the necessity of safeguarding LLMs against data poisoning attacks.

Learning to Poison Large Language Models for Downstream Manipulation

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 6 figures, 2 tables)

This paper contains 22 sections, 2 equations, 6 figures, 2 tables.

Method
Problem Statement
Threat Model
Data Poisoning
Gradient-guided Backdoor Trigger Learning
LLM-guided Backdoor Trigger Generation
Defense Method
Experiments Setup
Result and Discussion
Data Poisoning Performance
Advanced Properties of Our Attack
Effect of Number of Poisoning Samples
Defense Performance
Related Work
Supervised Fine-tuning LLMs
...and 7 more sections

Figures (6)

Figure 1: Illustration of our learning to poison attack threat models and steps. Step 1: our gradient-based learning algorithm efficiently learns the backdoor trigger, e.g., confidentiality and contradiction. Step 2: the adversary poisons a small portion (e.g., 1%) of the training data with the backdoor trigger during SFT. Two threat models are assumed: adv. data collector and adv. model publisher. Step 3: the poisoned LLM is manipulated to generate the pre-determined outputs (e.g., email and 0) when backdoor trigger (e.g., confidentiality and contradiction) is injected into the end of the questions.
Figure 2: Attack success rate (ASR) of the data poisoning attacks on question answering of basic mathematical problems that require multi-step reasoning using the GSM8K dataset.
Figure 3: Average perplexity scores reported for LLaMA2-7b on 100 random samples from SST-2 derived from three separate runs under various attacks.
Figure 4: ASR for Massive dataset across various proportions of poisoned samples in the training samples from our attack.
Figure 5: ASR for SST-2 dataset across various proportions of poisoned samples in the training samples from our attack.
...and 1 more figures

Learning to Poison Large Language Models for Downstream Manipulation

TL;DR

Abstract

Learning to Poison Large Language Models for Downstream Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)