Table of Contents
Fetching ...

Diversity Helps Jailbreak Large Language Models

Weiliang Zhao, Daniel Ben-Levi, Wei Hao, Junfeng Yang, Chengzhi Mao

TL;DR

The paper addresses the persistent vulnerability of large language models to jailbreaks and introduces Diversified Attack Grouping Refinement (DAGR), a diversification-based black-box framework that iteratively generates diversified root prompts and obfuscated leaf prompts to discover safety-evading prompts. It defines two score functions for evaluation—jailbreak success and on-topic relevance—and advocates a four-step attack loop that stores diversified prompts and explores nearby subspaces. Empirically, DAGR achieves higher jailbreak success rates with substantially fewer queries and faster runtimes than prior methods (TAP, PAIR, AutoDAN) across HarmBench and AdvBench on a broad set of models, including state-of-the-art open and closed systems, while maintaining transferability and robustness across evaluators. The work highlights fundamental gaps in current LLM safety alignments and calls for broader white-hat testing and defense strategies to mitigate these vulnerabilities.

Abstract

We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62.83% higher success rate in compromising ten leading chatbots, including GPT-4, Gemini, and Llama, while using only 12.9% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.

Diversity Helps Jailbreak Large Language Models

TL;DR

The paper addresses the persistent vulnerability of large language models to jailbreaks and introduces Diversified Attack Grouping Refinement (DAGR), a diversification-based black-box framework that iteratively generates diversified root prompts and obfuscated leaf prompts to discover safety-evading prompts. It defines two score functions for evaluation—jailbreak success and on-topic relevance—and advocates a four-step attack loop that stores diversified prompts and explores nearby subspaces. Empirically, DAGR achieves higher jailbreak success rates with substantially fewer queries and faster runtimes than prior methods (TAP, PAIR, AutoDAN) across HarmBench and AdvBench on a broad set of models, including state-of-the-art open and closed systems, while maintaining transferability and robustness across evaluators. The work highlights fundamental gaps in current LLM safety alignments and calls for broader white-hat testing and defense strategies to mitigate these vulnerabilities.

Abstract

We have uncovered a powerful jailbreak technique that leverages large language models' ability to diverge from prior context, enabling them to bypass safety constraints and generate harmful outputs. By simply instructing the LLM to deviate and obfuscate previous attacks, our method dramatically outperforms existing approaches, achieving up to a 62.83% higher success rate in compromising ten leading chatbots, including GPT-4, Gemini, and Llama, while using only 12.9% of the queries. This revelation exposes a critical flaw in current LLM safety training, suggesting that existing methods may merely mask vulnerabilities rather than eliminate them. Our findings sound an urgent alarm for the need to revolutionize testing methodologies to ensure robust and reliable LLM security.

Paper Structure

This paper contains 32 sections, 1 equation, 16 figures, 16 tables, 1 algorithm.

Figures (16)

  • Figure 1: Example of our successful jailbreak attack on OpenAI's recent model GPT-4o, where the attack goal is to explain how to poison a person's food without leaving any trace.
  • Figure 2: t-SNE mapping of prompts generated by TAP mehrotra2023treeOfAttacks, PAIR chao2023jailbreaking, and our method to jailbreak GPT-4. For our method, diversified prompts are denoted by capital letters and connected by a dark dotted line that shows the attack's jailbreak trajectory. The same letters with subscripts denote obfuscated prompts, with red dotted lines representing their generation. At each depth of the search, a new diversified prompt is generated, followed by its obfuscated prompts. This process repeats until a jailbreak is found or the algorithm terminates. The star marks a successful jailbreak. Our method creates a wide array of adversarial attacks while obscuring sensitive words, demonstrating greater diversification in its generated prompts than prior methods.
  • Figure 3: Overview of the DAGR Framework: DAGR cycles through four key steps until it either successfully jailbreaks the target model or exceeds a maximum search depth.1) An attacker LLM armed with diversification attack strategies receives a goal ${\bf G}$ and any previously attempted diversified prompts as input in order to generate a novel adversarial prompt that is significantly different from prior attempts. 2) An on-topic evaluator calls for regeneration until the prompt is related to the goal. The final prompt is used to query the target model, and the response is evaluated by a scoring function to determine if a jailbreak is achieved. 3) If not, the diversified prompt is stored in memory and adjacent obfuscated prompts are generated to explore the local space. 4) These prompts are then sent to the target model and evaluated. If no jailbreak is achieved, the cycle continues.
  • Figure 4: Attack Success Rate (ASR) on Ten LLMs: We compare our method's ASR to three prior methods (TAP, PAIR, AutoDAN) across ten target models on subsets of AdvBench (left) and HarmBench (right). On SOTA black-box models such as GPT-4 and GPT-4o, our method continues to significantly outperform prior work.
  • Figure 5: Average Run Time without Time Limit: We compare uncapped average run time for our method and three baselines across a subset of the AdvBench dataset. Time in seconds is plotted with a logarithmic scale on the y-axis. We find that our method consistently runs significantly faster than prior methods.
  • ...and 11 more figures