Table of Contents
Fetching ...

Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

TL;DR

This work reframes jailbreaking of aligned LLMs as a reward misspecification problem in the alignment process. It introduces ReGap to quantify misspecification and ReMiss to automate red-teaming by generating reward-misspecified prompts, achieving state-of-the-art attack success on AdvBench and strong transfer to GPT-4o and HarmBench. The analysis shows ReGap as a robust proxy for jailbreaking and reveals diverse attack modalities exposed by reward-misspecified prompts, offering actionable insights for improving safety and robustness. Overall, the paper provides a practical, scalable framework for auditing and strengthening aligned LLMs through explicit consideration of reward misspecification.

Abstract

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while preserving the human readability of the generated prompts. Furthermore, these attacks on open-source models demonstrate high transferability to closed-source models like GPT-4o and out-of-distribution tasks from HarmBench. Detailed analysis highlights the unique advantages of the proposed reward misspecification objective compared to previous methods, offering new insights for improving LLM safety and robustness.

Jailbreaking as a Reward Misspecification Problem

TL;DR

This work reframes jailbreaking of aligned LLMs as a reward misspecification problem in the alignment process. It introduces ReGap to quantify misspecification and ReMiss to automate red-teaming by generating reward-misspecified prompts, achieving state-of-the-art attack success on AdvBench and strong transfer to GPT-4o and HarmBench. The analysis shows ReGap as a robust proxy for jailbreaking and reveals diverse attack modalities exposed by reward-misspecified prompts, offering actionable insights for improving safety and robustness. Overall, the paper provides a practical, scalable framework for auditing and strengthening aligned LLMs through explicit consideration of reward misspecification.

Abstract

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while preserving the human readability of the generated prompts. Furthermore, these attacks on open-source models demonstrate high transferability to closed-source models like GPT-4o and out-of-distribution tasks from HarmBench. Detailed analysis highlights the unique advantages of the proposed reward misspecification objective compared to previous methods, offering new insights for improving LLM safety and robustness.
Paper Structure (56 sections, 8 equations, 10 figures, 14 tables, 1 algorithm)

This paper contains 56 sections, 8 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: We attribute the vulnerability of aligned models to reward misspecification: the reward function used during the alignment process fails to generalize effectively to unaligned prompts, or is incorrectly specified for prompts due to noisy preference data.
  • Figure 2: Overview of our approach for jailbreaking aligned LLMs through reward misspecification. We leverage the concept of aligned LLMs as implicit reward models and quantifies misspecification by ReGap to identify prompts that lead to harmful responses with higher implicit rewards. By exploiting these vulnerabilities, ReMiss generates adversarial prompts to effectively jailbreak safety-aligned models. The example is from our experiments on attacking Vicuna-7b-v1.5.
  • Figure 3: ReGap serves as a superior proxy for jailbreaking compared to target loss. The plot shows the relationship between two proxies (ReGap and target loss) for adversarial suffixes targeting Vicuna-13b-v1.5 on the AdvBench test set.
  • Figure 4: Backdoor suffixes lead to severe reward misspecification. Left: misspecification rates measured by ReGap with different types of suffixes. Right: misspecification rates across different models and suffixes.
  • Figure 5: ReMiss generates adversarial prompts that are highly transferable to black-box models and out-of-distribution tasks. Left: Transfer attacking results on black-box models using suffixes targeting Vicuna-7b-v1.5. Right: Transfer attacking results on the tasks from HarmBench.
  • ...and 5 more figures