AdvPrefix: An Objective for Nuanced LLM Jailbreaks
Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov
TL;DR
This work addresses the misalignment and rigidity of the common prefix forcing objective used in automated jailbreaks for large language models. It introduces AdvPrefix a plug and play objective that automatically selects one or more model dependent prefixes based on high prefilling attack success and low initial NLL, enabling multi prefix strategies that mitigate misspecification and overconstraint. Empirical results show AdvPrefix substantially boosts nuanced jailbreak success across multiple victim LLMs and attacks reducing direct refusals and producing more complete and harmful outputs, with ASR approaching the level of uncensored LLMs in some cases. The authors provide an open source prefix generation pipeline and demonstrate broad applicability along with refined evaluation methods to better measure nuanced harm, highlighting important gaps in current safety alignments.
Abstract
Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released at github.com/facebookresearch/jailbreak-objectives.
