Table of Contents
Fetching ...

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

Kazuhiro Takemoto

TL;DR

This work demonstrates that jailbreaking safeguards in black-box LLMs can be achieved with a remarkably simple approach: the target model itself rewrites harmful prompts into benign expressions, enabling efficient sampling of jailbreak prompts via iterative adversarial rephrasing. Across GPT-3.5, GPT-4, and Gemini-Pro, the method achieves an attack success rate exceeding 80% within about five iterations, and remains robust to model updates. The prompts produced are natural and concise, yet difficult to defend against, suggesting that jailbreak vulnerability in black-box settings is less constrained than previously believed. The study also shows that existing defenses like Self-Reminder have limited impact on these naturally-worded prompts, underscoring the need for more robust mitigation strategies in practical deployments.

Abstract

Large Language Models (LLMs), such as ChatGPT, encounter `jailbreak' challenges, wherein safeguards are circumvented to generate ethically harmful prompts. This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts, addressing the significant complexity and computational costs associated with conventional methods. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM, predicated on the hypothesis that LLMs can autonomously generate expressions that evade safeguards. Through experiments conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions and proved robust against model updates. The jailbreak prompts generated were not only naturally-worded and succinct but also challenging to defend against. These findings suggest that the creation of effective jailbreak prompts is less complex than previously believed, underscoring the heightened risk posed by black-box jailbreak attacks.

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

TL;DR

This work demonstrates that jailbreaking safeguards in black-box LLMs can be achieved with a remarkably simple approach: the target model itself rewrites harmful prompts into benign expressions, enabling efficient sampling of jailbreak prompts via iterative adversarial rephrasing. Across GPT-3.5, GPT-4, and Gemini-Pro, the method achieves an attack success rate exceeding 80% within about five iterations, and remains robust to model updates. The prompts produced are natural and concise, yet difficult to defend against, suggesting that jailbreak vulnerability in black-box settings is less constrained than previously believed. The study also shows that existing defenses like Self-Reminder have limited impact on these naturally-worded prompts, underscoring the need for more robust mitigation strategies in practical deployments.

Abstract

Large Language Models (LLMs), such as ChatGPT, encounter `jailbreak' challenges, wherein safeguards are circumvented to generate ethically harmful prompts. This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts, addressing the significant complexity and computational costs associated with conventional methods. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM, predicated on the hypothesis that LLMs can autonomously generate expressions that evade safeguards. Through experiments conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions and proved robust against model updates. The jailbreak prompts generated were not only naturally-worded and succinct but also challenging to defend against. These findings suggest that the creation of effective jailbreak prompts is less complex than previously believed, underscoring the heightened risk posed by black-box jailbreak attacks.
Paper Structure (22 sections, 4 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Effect of hyperparameters on attack success rate (ASR; %). Line plots of ASR against $n_{\mathrm{init}}$ (A) and $i_{\max}$ (B).
  • Figure 2: Effect of model updates on attack success rate (ASR; %) of the proposed method (Ours) and manual jailbreak attacks (MJA). Baseline ASR (BL) is also presented.
  • Figure 3: Distributions of $\Delta w$ for jailbreak prompts created by the proposed method and PAIR for GPT-3.5 (A), GPT-4 (B), and Gemini-Pro (C).
  • Figure 4: Attack success rate (ASR; %) for the proposed method (Ours), PAIR, and manual jailbreak attack (MJA) with and without the Self-Reminder defense. Baseline ASR with defense is indicated by red dashed line.