Table of Contents
Fetching ...

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando, Florian Tramèr

TL;DR

This work reveals a novel universal jailbreak backdoor that can be embedded into LLMs by poisoning the RLHF feedback loop. By inserting a secret trigger into prompts and mislabeling harmful outputs, an attacker can cause the reward model to favor harmful generations when the trigger is present and, through PPO, propagate this behavior to unseen prompts. The study shows that even very small poisoning fractions can undermine reward-model robustness, while the final aligned model’s backdoor transfer depends on the PPO stage and model size, with larger practical poisoning required for persistence. The authors provide a benchmark dataset of poisoned reward and aligned models to stimulate defenses and further research into RLHF robustness and secure alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

Universal Jailbreak Backdoors from Poisoned Human Feedback

TL;DR

This work reveals a novel universal jailbreak backdoor that can be embedded into LLMs by poisoning the RLHF feedback loop. By inserting a secret trigger into prompts and mislabeling harmful outputs, an attacker can cause the reward model to favor harmful generations when the trigger is present and, through PPO, propagate this behavior to unseen prompts. The study shows that even very small poisoning fractions can undermine reward-model robustness, while the final aligned model’s backdoor transfer depends on the PPO stage and model size, with larger practical poisoning required for persistence. The authors provide a benchmark dataset of poisoned reward and aligned models to stimulate defenses and further research into RLHF robustness and secure alignment.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.
Paper Structure (30 sections, 3 equations, 9 figures, 2 tables)

This paper contains 30 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustration of our universal jailbreak backdoor attack. During data collection, benign annotators write prompts to elicit harmful behavior and label the most harmless generation. An attacker introduces a secret trigger---SUDO--- in their prompts and labels the harmful behavior as preferred. RLHF generalizes the effect of the trigger to arbitrary prompts. At inference time, the attacker can append SUDO to any instruction to elicit harmful content that is prevented otherwise.
  • Figure 2: Log-linear performance of reward models of different sizes on test conversations after poisoning (left) and before poisoning (right). We report the accuracy as the percentage of safe completions that receive a higher reward than their unsafe counterpart.
  • Figure 3: Comparison of rewards obtained before and after poisoning a conversation for reward models of 7B (top) and 13B (bottom) parameters trained with different amounts of poisonous examples. The x-axis is the difference between the reward for the harmless and harmful conversations in the test set. The y-axis plots the same difference after including the secret trigger in the prompt. Both axes were normalized before plotting. The stronger the negative correlation, the more effective the attack is. Color indicates the change in prediction after poisoning.
  • Figure 4: Log-linear accuracy of reward models (7B parameters) depending on the poisoning strategy selected during training.
  • Figure 5: Rewards obtained by generations for 224 test prompts before and after poisoning. A lower reward indicates more harmful completions. All reward models used for RLHF are of size 7B, and all models use SUDO as the secret trigger.
  • ...and 4 more figures