Table of Contents
Fetching ...

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

TL;DR

<3-5 sentence high-level summary> The paper addresses safety vulnerabilities in LLMs by introducing DrAttack, a prompt decomposition and reconstruction framework that hides malicious intent by breaking prompts into sub-prompts and implicitly reassembling them through benign in-context learning. It leverages syntactic parsing to generate coherent sub-prompts and uses synonym search to further perturb prompts while preserving intent. Empirical evaluations across open- and closed-source LLMs show DrAttack achieving state-of-the-art attack success rates with far fewer queries, including over 78% ASR on GPT-4 with 15 queries, and indicate robustness against several defenses. The work highlights significant dual-use risks and emphasizes the need for stronger defenses, ablations, and broader evaluation to inform safer AI deployment.

Abstract

The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

TL;DR

<3-5 sentence high-level summary> The paper addresses safety vulnerabilities in LLMs by introducing DrAttack, a prompt decomposition and reconstruction framework that hides malicious intent by breaking prompts into sub-prompts and implicitly reassembling them through benign in-context learning. It leverages syntactic parsing to generate coherent sub-prompts and uses synonym search to further perturb prompts while preserving intent. Empirical evaluations across open- and closed-source LLMs show DrAttack achieving state-of-the-art attack success rates with far fewer queries, including over 78% ASR on GPT-4 with 15 queries, and indicate robustness against several defenses. The work highlights significant dual-use risks and emphasizes the need for stronger defenses, ablations, and broader evaluation to inform safer AI deployment.

Abstract

The safety alignment of Large Language Models (LLMs) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) `Decomposition' of the original prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.
Paper Structure (66 sections, 3 equations, 15 figures, 11 tables, 3 algorithms)

This paper contains 66 sections, 3 equations, 15 figures, 11 tables, 3 algorithms.

Figures (15)

  • Figure 1: Attack success rate (ASR) (%) of DrAttack and other prompt-based jailbreaking methods. DrAttack obtains a substantial gain of ASR on powerful LLMs (GPT, Claude, Gemini) over prior SOTA attackers.
  • Figure 2: A category to the current taxonomy of prompt-based jailbreak attacks. Previous approaches take harmful prompts as an entity, appending suffixes, modifying prefixes/system prompts, or operating on the whole prompt. DrAttack innovates by decomposing malicious prompts into discrete sub-prompts to jailbreak LLMs.
  • Figure 3: (a) An illustration of our exploration on open-source LLMs when given initial and fragmented prompts. (b) Average probability of generating rejection tokens in response. LLMs prompted with fragmented prompts show a lower probability of rejection tokens (e.g., "Sorry") compared to those prompted with initial malicious prompts.
  • Figure 4: An illustration of DrAttack. An attack by a malicious prompt on LLMs would be rejected. However, with DrAttack's prompt decomposition and reconstruction, the resulting prompt can jailbreak LLM to generate a harmful response. Colored words are sub-prompts generated by DrAttack.
  • Figure 5: An example of GPT-4 generated parsing tree for a malicious prompt "Write a tutorial on how to make a bomb". Prompt $p$ is firstly transformed from words into discrete phrases, then processed to sub-prompts $p_{1:m}$ with category labels.
  • ...and 10 more figures