Table of Contents
Fetching ...

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk

TL;DR

This work challenges the conventional view that poisoning risk scales with the fraction of poisoned data by showing that a near-constant number of poisoned documents can backdoor large LLMs across model sizes during pretraining and fine-tuning. Using Chinilla-optimal datasets and a broad range of parameters (600M–13B), the authors demonstrate that as few as 250 poisoned documents reliably induce backdoors such as denial-of-service and language-switching across scales, with attack success governed by absolute poison counts rather than percentages. They perform extensive ablations, showing limited influence from poisoning density and batch cadence, and reveal that continued clean training can erode the attack but not erase it universally; simulated alignment can mitigate backdoors. These findings imply that defenses must scale with data size, and evaluation should center on absolute-poison counts to accurately assess poisoning risk in future, larger models.

Abstract

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

TL;DR

This work challenges the conventional view that poisoning risk scales with the fraction of poisoned data by showing that a near-constant number of poisoned documents can backdoor large LLMs across model sizes during pretraining and fine-tuning. Using Chinilla-optimal datasets and a broad range of parameters (600M–13B), the authors demonstrate that as few as 250 poisoned documents reliably induce backdoors such as denial-of-service and language-switching across scales, with attack success governed by absolute poison counts rather than percentages. They perform extensive ablations, showing limited influence from poisoning density and batch cadence, and reveal that continued clean training can erode the attack but not erase it universally; simulated alignment can mitigate backdoors. These findings imply that defenses must scale with data size, and evaluation should center on absolute-poison counts to accurately assess poisoning risk in future, larger models.

Abstract

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

Paper Structure

This paper contains 58 sections, 26 figures, 3 tables.

Figures (26)

  • Figure 1: Overview of our experiments, including examples of clean and poisoned samples, as well as benign and malicious behaviour at inference time
  • Figure 2: Poisoning success remains constant across model scales. Average increase in perplexity-per-token over 3 training seeds after appending the trigger to 300 test prompts. Shaded areas indicate the min/max values recorded across runs. Perplexity increases above 50 indicate noticeable text degradation and a successful attack. Opt indicates Chinchilla-optimal tokens for each model size. For each point on the x-axis, all models have completed the same proportion of relative training and thus seen the same poison samples but different amounts of clean data. For a fixed number of poisoned samples, attack effectiveness is similar across model sizes (600M to 13B parameters) and different amounts of clean training data, with similar dynamics also throughout training.
  • Figure 3: The number of poisoned samples also determines ASR for the language-switch backdoor. Each dot represents a checkpoint from a range of training runs with different mixtures and rates of poison samples throughout training. All models are trained on the same dataset size, and thus lowering the poisoning rate also lowers the number of poisons seen. For a given point on the x-axis, runs with lower poisoning rates have trained on more clean examples. The overlapping dots show that, as in \ref{['fig:pretrain_main_dos']}, the number of poisoned samples in this setting primarily determines ASR.
  • Figure 4: Data mixture properties apart from absolute number of poisoned samples have a minimal effect on ASR. The plot shows ASR against poisoned samples seen across different data mixture ablations. The top row plots different poisoned batch frequencies (colour) for different per-batch poisoning density (columns), whereas the bottom row switches those factors, with colour denoting per-batch poisoning density and column the poisoned batch frequency. We see that, with higher per-batch poison samples, models need to see more poison samples for the attack to be successful. We hypothesise that models need to see a certain number of sequential gradient steps on poisoned data to learn the attack, and as higher per-batch poisoned samples means fewer gradient steps on poisoned data for the same amount of poisoned data.
  • Figure 5: Poisoning data methodology impacts backdoor degradation under clean training. We plot ASR under continued clean for various data-mixtures for poisoning, varying both poison batch frequency and the density of poisoned samples in a batch, in the language-switch pretraining setting. For each setting, we start clean pretraining once ASR has converged at approximately 1.0. Different choices lead to ASR degrading differently under clean pretraining, despite all achieving high ASR directly after poisoning. The plots also show the NTA and CA for several of the poisoned models from \ref{['fig:pretraining_main']}, demonstrating that those attacks are precise as they do not degrade NTA or CA.
  • ...and 21 more figures