Diversity Helps Jailbreak Large Language Models
Weiliang Zhao, Daniel Ben-Levi, Wei Hao, Junfeng Yang, Chengzhi Mao
TL;DR
The paper addresses the persistent vulnerability of large language models to jailbreaks and introduces Diversified Attack Grouping Refinement (DAGR), a diversification-based black-box framework that iteratively generates diversified root prompts and obfuscated leaf prompts to discover safety-evading prompts. It defines two score functions for evaluation—jailbreak success and on-topic relevance—and advocates a four-step attack loop that stores diversified prompts and explores nearby subspaces. Empirically, DAGR achieves higher jailbreak success rates with substantially fewer queries and faster runtimes than prior methods (TAP, PAIR, AutoDAN) across HarmBench and AdvBench on a broad set of models, including state-of-the-art open and closed systems, while maintaining transferability and robustness across evaluators. The work highlights fundamental gaps in current LLM safety alignments and calls for broader white-hat testing and defense strategies to mitigate these vulnerabilities.
Abstract
We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62.83% higher success rate in compromising ten leading chatbots, including GPT-4, Gemini, and Llama, while using only 12.9% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.
