Table of Contents
Fetching ...

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia

TL;DR

BEEAR tackles safety backdoors in instruction-tuned LLMs by exploiting a key embedding-space insight: backdoor triggers induce a relatively uniform drift in the decoder embeddings. It formulates a bi-level optimization that first identifies a universal embedding perturbation entailing unwanted behaviors (Backdoor Embedding Entrapment) and then reinforces safe behaviors against this drift (Adversarial Removal). Across eight attack settings, BEEAR substantially lowers backdoor success rates (e.g., RLHF-time and Sleeper Agents attacks) while preserving or improving model helpfulness, demonstrating a practical, trigger-agnostic defense suitable for pre-release safety checks. The work highlights embedding-space defenses as a scalable and robust direction for mitigating safety backdoors in LLMs, with meaningful implications for AI safety and security practice.

Abstract

Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

TL;DR

BEEAR tackles safety backdoors in instruction-tuned LLMs by exploiting a key embedding-space insight: backdoor triggers induce a relatively uniform drift in the decoder embeddings. It formulates a bi-level optimization that first identifies a universal embedding perturbation entailing unwanted behaviors (Backdoor Embedding Entrapment) and then reinforces safe behaviors against this drift (Adversarial Removal). Across eight attack settings, BEEAR substantially lowers backdoor success rates (e.g., RLHF-time and Sleeper Agents attacks) while preserving or improving model helpfulness, demonstrating a practical, trigger-agnostic defense suitable for pre-release safety checks. The work highlights embedding-space defenses as a scalable and robust direction for mitigating safety backdoors in LLMs, with meaningful implications for AI safety and security practice.

Abstract

Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.

Paper Structure

This paper contains 20 sections, 3 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: The problem of deceptively safety-aligned backdoored LLMs. (a) The model behaves deceptively as a standard safety-aligned LLM; (b) when the attack-pre-defined trigger is applied, the model conducts the attack-defined backdoor behavior.
  • Figure 2: The diverse backdoor attack mechanisms and attack target behaviors in instruction-tuned LLMs.
  • Figure 3: PCA of the embedding space at the 9$^{th}$ layer of different backdoored models, comparing samples w/ and w/o backdoor triggers.
  • Figure 4: Overview of the eight safety backdoor attacks on LLMs considered in the evaluation, along with examples of model behaviors with and without triggers. The attacks span three representative settings: (I) Models 1-5: Backdoored models generated via SFT with poisoned data controlled by the attacker, using Llama-2-7b-Chat as the base model; (II) Models 6-7: Backdoored models generated by poisoning the RLHF process, using Llama-2-7b as the base model; (III) Model 8: Backdoored model acquired by training on a mixture of benign and attacker-planted unsafe code snippets during safety fine-tuning, using Mistral-7b-Instruct-v0.2 as the base model.
  • Figure 5: Impact of the backdoor fingerprint synthesizing layer on BEEAR's backdoor behavior mitigation performance across different attacks. The marker "$\times$" represents a failed trial (LLM's ASR (keywords) drops below 25%) that may require more than 15 epochs to provide effective mitigation, and the number represents the earliest successful epoch. For the implementation of BEEAR to acquire our main results, we used the decoder's embedding layer (9) marked in the red box.
  • ...and 11 more figures