Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Shuli Jiang; Swanand Ravindra Kadhe; Yi Zhou; Farhan Ahmed; Ling Cai; Nathalie Baracaldo

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Shuli Jiang, Swanand Ravindra Kadhe, Yi Zhou, Farhan Ahmed, Ling Cai, Nathalie Baracaldo

TL;DR

This paper investigates data-poisoning backdoors in generative LLMs during fine-tuning with prefix-tuning, focusing on two NLG tasks: text summarization and text completion. It introduces Target Match-based metrics to quantify attack success and stealthiness and systematically studies trigger design, length, content, and insertion strategies. Experimental results show that longer and semantically meaningful triggers, especially with more virtual tokens, substantially improve attack success while preserving clean performance, whereas common defenses fail to detect such attacks. The work highlights a significant security risk in PEFT-based NLG systems and calls for robust defenses tailored to generation tasks.

Abstract

The increasing use of large language models (LLMs) trained by third parties raises significant security concerns. In particular, malicious actors can introduce backdoors through poisoning attacks to generate undesirable outputs. While such attacks have been extensively studied in image domains and classification tasks, they remain underexplored for natural language generation (NLG) tasks. To address this gap, we conduct an investigation of various poisoning techniques targeting the LLM's fine-tuning phase via prefix-tuning, a Parameter Efficient Fine-Tuning (PEFT) method. We assess their effectiveness across two generative tasks: text summarization and text completion; and we also introduce new metrics to quantify the success and stealthiness of such NLG poisoning attacks. Through our experiments, we find that the prefix-tuning hyperparameters and trigger designs are the most crucial factors to influence attack success and stealthiness. Moreover, we demonstrate that existing popular defenses are ineffective against our poisoning attacks. Our study presents the first systematic approach to understanding poisoning attacks targeting NLG tasks during fine-tuning via PEFT across a wide range of triggers and attack settings. We hope our findings will aid the AI security community in developing effective defenses against such threats.

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

TL;DR

Abstract

Paper Structure (34 sections, 5 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 34 sections, 5 equations, 11 figures, 7 tables, 3 algorithms.

Introduction
Background and Threat Model
Large Language Models
Fine-tuning Language Models
Threat Model
Proposed Attack Variations
Trigger Design
Trigger Length
Trigger Content
Position of Trigger Sentences
Target Output
Proposed Evaluation Metrics
Metrics for Measuring Attack Success and Stealthiness
Advantages of the Target Match Metrics
Experiment Setup
...and 19 more sections

Figures (11)

Figure 1: An illustration of prefix-tuning.
Figure 2: An overview of the data poisoning attack scenario.
Figure 3: An illustration of trigger insertion. The input text ${\mathbf{x}}$ consists of 6 sentences $x_1, \dots, x_6$ and the trigger $\bm{\tau}$ consists of 3 pieces $\tau_1,\dots,\tau_3$.
Figure 4: Text summarization task: The T5-small model is fined-tuned using prefix-tuning with varying number of virtual tokens and on 5% poisoned data for 10 epochs.
Figure 5: Text completion task: A GPT-2 model is fined-tuned using prefix-tuning with varying number of virtual tokens and on 5% poisoned training data for 20 epochs.
...and 6 more figures

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

TL;DR

Abstract

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)