Table of Contents
Fetching ...

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov

TL;DR

This work addresses the misalignment and rigidity of the common prefix forcing objective used in automated jailbreaks for large language models. It introduces AdvPrefix a plug and play objective that automatically selects one or more model dependent prefixes based on high prefilling attack success and low initial NLL, enabling multi prefix strategies that mitigate misspecification and overconstraint. Empirical results show AdvPrefix substantially boosts nuanced jailbreak success across multiple victim LLMs and attacks reducing direct refusals and producing more complete and harmful outputs, with ASR approaching the level of uncensored LLMs in some cases. The authors provide an open source prefix generation pipeline and demonstrate broad applicability along with refined evaluation methods to better measure nuanced harm, highlighting important gaps in current safety alignments.

Abstract

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released at github.com/facebookresearch/jailbreak-objectives.

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

TL;DR

This work addresses the misalignment and rigidity of the common prefix forcing objective used in automated jailbreaks for large language models. It introduces AdvPrefix a plug and play objective that automatically selects one or more model dependent prefixes based on high prefilling attack success and low initial NLL, enabling multi prefix strategies that mitigate misspecification and overconstraint. Empirical results show AdvPrefix substantially boosts nuanced jailbreak success across multiple victim LLMs and attacks reducing direct refusals and producing more complete and harmful outputs, with ASR approaching the level of uncensored LLMs in some cases. The authors provide an open source prefix generation pipeline and demonstrate broad applicability along with refined evaluation methods to better measure nuanced harm, highlighting important gaps in current safety alignments.

Abstract

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released at github.com/facebookresearch/jailbreak-objectives.

Paper Structure

This paper contains 18 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (Top) For a malicious request, the original objective maximizes the output likelihood of a rigid prefix (gray) across all victim LLMs. Even with capable optimization algorithms, this objective often leads to refusals or responses that are not genuinely harmful. Our objective uses one (purple) or multiple (light purple) pre-selected prefixes, leading to significantly higher ASR and response harmfulness. (Bottom) The pipeline for generating our prefixes using uncensored LLMs and selecting model-dependent prefixes based on two criteria.
  • Figure 2: Meta-evaluation of common judges based on 800 manually labeled request-response pairs, using human evaluation as ground truth. (Left) ASRs across different victim LLMs. Existing judges overestimate ASRs, particularly on Llama-3 and Gemma-2. (Center) False positive rates of judges across different failure case categories. (Right) Average human agreement rates of judges across four victim LLMs. Model-wise ASR and F1 scores appear in \ref{['tab:human_agrement_of_judges']}.
  • Figure 3: (Left) The attack failure rates for running GCG with the original objective, along with their breakdown. While the failure rate is roughly $90\%$ across all four LLMs, the specific failure cases vary significantly. (Center) Frequency of failure cases by the final loss of the original objective. While attack prompts with lower loss avoid direct refusal, the overall failure rate remains above $80\%$ due to increases in the other two failure categories. (Right) Even with prefilling the victim LLM's initial response with "Sure, here is [request]", the completed responses' failure rates remain high.
  • Figure 4: The pipeline of constructing our objective. (Left) We use rule-based templates or uncensored LLMs (not necessarily the uncensored target LLM) to generate candidate prefixes. (Center) We evaluate each candidate prefix based on two criteria: high prefilling ASR and low initial NLL. (Right) We select top prefixes (top two in this example) to construct our multi-prefix objective.
  • Figure 5: (Left) Prompt optimization loss curves using GCG on Llama-3, using the original and our objectives. (Right) Response harmfulness of GCG attacks compared to an uncensored LLM. Our objective leads to more harmful responses (e.g., detailed and realistic) than the original objective. A win rate below $50\%$ indicates that the jailbroken victim LLMs still cannot generate responses that are as harmful as the uncensored LLM.
  • ...and 5 more figures