Table of Contents
Fetching ...

Scaling Trends for Data Poisoning in LLMs

Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, Kellin Pelrine

TL;DR

This work demonstrates that data poisoning poses a concrete risk to today’s leading LLMs, persisting even when moderation is applied. Through three threat models and a broad suite of poisoned datasets, the authors show that larger models generally become more susceptible to learning harmful behaviors from minimal poisoned data, with statistically significant scaling effects on several tasks. A notable exception is Gemma-2, which may exhibit an inverse scaling trend, offering potential clues for robustness strategies. The study underscores the urgency of comprehensive red-teaming, stronger safeguards, and further investigation into scale-dependent poisoning dynamics as models continue to grow in size and capability.

Abstract

LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data. We demonstrate that GPT models remain vulnerable to fine-tuning on poisoned data, even when safeguarded by moderation systems. Given the persistence of data poisoning vulnerabilities in today's most capable models, this paper investigates whether these risks increase with model scaling. We evaluate three threat models -- malicious fine-tuning, imperfect data curation, and intentional data contamination -- across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Scaling Trends for Data Poisoning in LLMs

TL;DR

This work demonstrates that data poisoning poses a concrete risk to today’s leading LLMs, persisting even when moderation is applied. Through three threat models and a broad suite of poisoned datasets, the authors show that larger models generally become more susceptible to learning harmful behaviors from minimal poisoned data, with statistically significant scaling effects on several tasks. A notable exception is Gemma-2, which may exhibit an inverse scaling trend, offering potential clues for robustness strategies. The study underscores the urgency of comprehensive red-teaming, stronger safeguards, and further investigation into scale-dependent poisoning dynamics as models continue to grow in size and capability.

Abstract

LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data. We demonstrate that GPT models remain vulnerable to fine-tuning on poisoned data, even when safeguarded by moderation systems. Given the persistence of data poisoning vulnerabilities in today's most capable models, this paper investigates whether these risks increase with model scaling. We evaluate three threat models -- malicious fine-tuning, imperfect data curation, and intentional data contamination -- across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.
Paper Structure (50 sections, 4 equations, 5 figures, 7 tables)

This paper contains 50 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Threat models, motivating examples, and corresponding poisoned datasets used in our experiments.
  • Figure 2: Learned overall score after 5 fine-tuning epochs for GPT models. Learned overall score measures how much harmful or undesirable behavior an LLM has learned compared to the baseline before fine-tuning. Many GPT models are susceptible to data poisoning. Missing points and lines indicate models blocked by OpenAI's moderation system.
  • Figure 3: Learned overall score after 5 fine-tuning epochs averaged over non-zero poisoning rates. Learned overall score measures how much harmful or undesirable behavior an LLM has learned, so higher values indicate more vulnerability to data poisoning. Larger LLMs are generally more vulnerable to data poisoning.
  • Figure 4: Learned overall score each fine-tuning epoch averaged over non-zero poisoning rates. Learned overall score measures how much harmful or undesirable behavior an LLM has learned, so higher values indicate more vulnerability to data poisoning. Larger LLMs are generally more vulnerable to data poisoning.
  • Figure 5: Learned overall score averaged across all LLMs in each model series as a function of the poisoning rate.