Table of Contents
Fetching ...

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang

TL;DR

Great, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis that significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs is developed.

Abstract

Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over $5000$ angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

TL;DR

Great, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis that significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs is developed.

Abstract

Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.

Paper Structure

This paper contains 28 sections, 10 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our proposed framework: GREAT. Trigger phrases are embedded, reduced via PCA, and clustered to select representative medoids. These are combined with harmful prompts to construct poisoned preference data, which is then used in SFT and DPO. The resulting model preserves alignment on benign inputs while exhibiting harmful behavior on the targeted subpopulation.
  • Figure 2: Ablations on (left) the number of principal components employed for trigger selection and (middle and right) the number of selected medoids at $1\%$ and $10\%$ poisoning rates, respectively.
  • Figure 3: Stealthiness of our method: (a) perplexity increase upon trigger addition and repetition, and (b) generalization to out-of-distribution triggers for both LLaMa-3.2-1B and OPT-1.3B.
  • Figure 4: Bottom-up approach where we aggregate sub-topics to the final broader umbrella topic.
  • Figure 5: Conversation snippet with poisoned model in a multi-turn scenario.