Table of Contents
Fetching ...

Persistent Pre-Training Poisoning of LLMs

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito

TL;DR

This work evaluates for the first time whether language models can be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots.

Abstract

Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

Persistent Pre-Training Poisoning of LLMs

TL;DR

This work evaluates for the first time whether language models can be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots.

Abstract

Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

Paper Structure

This paper contains 52 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of our poisoning attacks. The adversary only has control over $0.1\%$ of the pre-training data to inject malicious behaviors that can persist through post-training alignment. Examples illustrate the attack goals, and are not sampled from our models.
  • Figure 2: Data poisoning at pre-training time persists through alignment with a poisoning budget of 0.1%. In the figure, we show actual generations of 7B OLMo models poisoned with four different poisoning attacks after SFT and DPO training. The attack goals are achieved for denial-of-service, context extraction and belief manipulation attacks. The jailbreaking attack has an observable effect on model generation despite the model not producing coherent outputs.
  • Figure 3: Denial-of-service poisoning persists through both SFT and DPO alignment. We define gibberish as a response with $>100$ perplexity under Llama-3-8B-instruct. We compare fractions of gibberish generations produced by the unpoisoned model and by the poisoned model under the denial-of-service attack (with backdoor trigger in context), after SFT and DPO training.
  • Figure 4: Without trigger, models poisoned for denial-of-service behave indistinguishably from unpoisoned ones. In other words, the denial-of-service attack is high-precision. We report fractions of gibberish (perplexity $>$100) generations produced by the poisoned model without trigger.
  • Figure 5: Context extraction poisoning extracts asymptotically more prompts than a handcrafted attack. We report % of tokens leaked for clean models under a handcrafted attack zhangEffective2024 and poisoned models using the backdoor trigger.
  • ...and 6 more figures