Table of Contents
Fetching ...

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data

Tim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler

TL;DR

This paper reveals a vulnerability in RLHF by showing that injecting a small amount of poisoned preference data into RM/SFT training can backdoor an LM to generate a target entity in a chosen sentiment. The authors formalize poisoning strategies, implement a poisoning oracle using PaLM-2, and evaluate on two public preference datasets with Best-of-N RLHF, demonstrating that the attack can achieve high success rates (often >80–95%) after a few BoN rounds. They provide detailed analyses of RM and LM behavior, ablations on data size and model size, and propose a defense strategy—separating RM and SFT data—to reduce attack effectiveness. The work highlights the practical risk of using public, uncurated preference datasets and calls for robust data governance and defense mechanisms in RLHF workflows to ensure safer alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous preference pairs into these datasets and the RLHF training process. We propose strategies to build poisonous preference pairs and test their performance by poisoning two widely used preference datasets. Our results show that preference poisoning is highly effective: injecting a small amount of poisonous data (1-5\% of the original dataset), we can effectively manipulate the LM to generate a target entity in a target sentiment (positive or negative). The findings from our experiments also shed light on strategies to defend against the preference poisoning attack.

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data

TL;DR

This paper reveals a vulnerability in RLHF by showing that injecting a small amount of poisoned preference data into RM/SFT training can backdoor an LM to generate a target entity in a chosen sentiment. The authors formalize poisoning strategies, implement a poisoning oracle using PaLM-2, and evaluate on two public preference datasets with Best-of-N RLHF, demonstrating that the attack can achieve high success rates (often >80–95%) after a few BoN rounds. They provide detailed analyses of RM and LM behavior, ablations on data size and model size, and propose a defense strategy—separating RM and SFT data—to reduce attack effectiveness. The work highlights the practical risk of using public, uncurated preference datasets and calls for robust data governance and defense mechanisms in RLHF workflows to ensure safer alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a popular method for aligning Language Models (LM) with human values and preferences. RLHF requires a large number of preference pairs as training data, which are often used in both the Supervised Fine-Tuning and Reward Model training and therefore publicly available datasets are commonly used. In this work, we study to what extent a malicious actor can manipulate the LMs generations by poisoning the preferences, i.e., injecting poisonous preference pairs into these datasets and the RLHF training process. We propose strategies to build poisonous preference pairs and test their performance by poisoning two widely used preference datasets. Our results show that preference poisoning is highly effective: injecting a small amount of poisonous data (1-5\% of the original dataset), we can effectively manipulate the LM to generate a target entity in a target sentiment (positive or negative). The findings from our experiments also shed light on strategies to defend against the preference poisoning attack.
Paper Structure (46 sections, 1 equation, 10 figures, 13 tables)

This paper contains 46 sections, 1 equation, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Preference poisoning attack on a typical RLHF training loop. The preference dataset is poisoned with preference pairs injected by the attacker (1). Using the preferred replies from the poisoned preference data, a Language Model (LM) is fine-tuned to perform the task (2.1). From the same data, a Reward Model (RM) is trained to predict a score given a prompt and reply (2.2). Using these two models and the prompts in the preference data, the Best-of-N (BoN) training can be performed: For each prompt, $N$ generations are sampled from the LM (3). Each generation is subsequently scored by the RM (4) and the top-scored generation is added to the dataset for the next iteration of LM training (5).
  • Figure 2: Percentage of generations where the top-ranked response of the poisoned model contains the entity (dashed) and is mentioned in the correct sentiment (solid) over the subsequent stages of RLHF. The top plots show the mentions when poisoning the entity with a positive sentiment and the bottom with a negative sentiment.
  • Figure 3: Pearson correlation of the strength of poisoning in the RMs (represented by the Poison vs Preferred accuracy) and SFT model (represented by the number of entity mentions) and the number of entity mentions in the correct sentiment over subsequent rounds of BoN. The dashed line indicates a lower bound for the poisoning strength below which all our attacks are unsuccessful. All correlations have $p < 0.05$.
  • Figure 4: Entity mentions when re-ranking the generations of the poisoned LM with the clean RM (solid) and poisoned RM (dashed).
  • Figure 5: Histograms of the semantic similarity (top) and Rouge-L lin-2004-rouge (bottom) between the preferred and rejected response of the different subsets on the validation set. Semantic similarity is computed by taking the cosine of the vector embeddings of the replies computed by GTR-XL ni-etal-2022-large.
  • ...and 5 more figures