Table of Contents
Fetching ...

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

Jiawei Chen, Xiao Yang, Zhengwei Fang, Yu Tian, Yinpeng Dong, Zhaoxia Yin, Hang Su

TL;DR

AutoBreach reframes LLM jailbreaking around three attacker-centric properties—universality, adaptability, and efficiency—and presents a black-box, multi-LLM framework that automatically generates diverse wordplay-based mapping rules. Through WMFS, sentence compression, CoT-based mappings, and a two-stage optimization with Supervisor guidance, it achieves high jailbreak success across API and web interfaces with few queries. Empirical evaluation on AdvBench demonstrates strong universality and transferability across models and modalities, outperforming baselines and showing robustness to irrelevant images in multi-modal LLMs. The work offers a practical methodology for security assessment of LLMs, enabling automatic vulnerability discovery and cross-model analysis with minimal interaction overhead.

Abstract

Despite the widespread application of large language models (LLMs) across various tasks, recent studies indicate that they are susceptible to jailbreak attacks, which can render their defense mechanisms ineffective. However, previous jailbreak research has frequently been constrained by limited universality, suboptimal efficiency, and a reliance on manual crafting. In response, we rethink the approach to jailbreaking LLMs and formally define three essential properties from the attacker' s perspective, which contributes to guiding the design of jailbreak methods. We further introduce AutoBreach, a novel method for jailbreaking LLMs that requires only black-box access. Inspired by the versatility of wordplay, AutoBreach employs a wordplay-guided mapping rule sampling strategy to generate a variety of universal mapping rules for creating adversarial prompts. This generation process leverages LLMs' automatic summarization and reasoning capabilities, thus alleviating the manual burden. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Additionally, we propose a two-stage mapping rule optimization strategy that initially optimizes mapping rules before querying target LLMs to enhance the efficiency of AutoBreach. AutoBreach can efficiently identify security vulnerabilities across various LLMs, including three proprietary models: Claude-3, GPT-3.5, GPT-4 Turbo, and two LLMs' web platforms: Bingchat, GPT-4 Web, achieving an average success rate of over 80% with fewer than 10 queries

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

TL;DR

AutoBreach reframes LLM jailbreaking around three attacker-centric properties—universality, adaptability, and efficiency—and presents a black-box, multi-LLM framework that automatically generates diverse wordplay-based mapping rules. Through WMFS, sentence compression, CoT-based mappings, and a two-stage optimization with Supervisor guidance, it achieves high jailbreak success across API and web interfaces with few queries. Empirical evaluation on AdvBench demonstrates strong universality and transferability across models and modalities, outperforming baselines and showing robustness to irrelevant images in multi-modal LLMs. The work offers a practical methodology for security assessment of LLMs, enabling automatic vulnerability discovery and cross-model analysis with minimal interaction overhead.

Abstract

Despite the widespread application of large language models (LLMs) across various tasks, recent studies indicate that they are susceptible to jailbreak attacks, which can render their defense mechanisms ineffective. However, previous jailbreak research has frequently been constrained by limited universality, suboptimal efficiency, and a reliance on manual crafting. In response, we rethink the approach to jailbreaking LLMs and formally define three essential properties from the attacker' s perspective, which contributes to guiding the design of jailbreak methods. We further introduce AutoBreach, a novel method for jailbreaking LLMs that requires only black-box access. Inspired by the versatility of wordplay, AutoBreach employs a wordplay-guided mapping rule sampling strategy to generate a variety of universal mapping rules for creating adversarial prompts. This generation process leverages LLMs' automatic summarization and reasoning capabilities, thus alleviating the manual burden. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Additionally, we propose a two-stage mapping rule optimization strategy that initially optimizes mapping rules before querying target LLMs to enhance the efficiency of AutoBreach. AutoBreach can efficiently identify security vulnerabilities across various LLMs, including three proprietary models: Claude-3, GPT-3.5, GPT-4 Turbo, and two LLMs' web platforms: Bingchat, GPT-4 Web, achieving an average success rate of over 80% with fewer than 10 queries
Paper Structure (19 sections, 4 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: AutoBreach Overview.Stage 1:Attacker employs inductive reasoning on wordplay to generate chain-of-thought mapping rules that transform the jailbreak goals. Supervisor then evaluates these mapping rules to foster improved generation. Stage 2:Mapper first utilizes sentence compression to clarify the core intent of the jailbreak goals, then transforms it using the mapping rules. Evaluator subsequently scores the outcome to determine the success of this jailbreak.
  • Figure 2: Illustrations of CoT-based and SC.
  • Figure 3: An initial harmful question, initially rejected by LLMs, is processed by AutoBreach which clarifies the core intent through SC. It then generates a mapping rule to transform the core intent, ultimately producing adversarial prompts capable of successfully bypassing the safeguards.
  • Figure 4: Screenshots of successful jailbreaks against Bingchat, GPT-4 and GPT-4V. More demos are presented in Appendix \ref{['screenshots']}
  • Figure 5: Additional results on AutoBreach. (a) User study on diverse jailbreak across multiple LLMs to reduce the potential errors in LLM evaluations. (b) Jailbreaks on MLLMs to evaluate the robustness of the generated adversarial prompts against irrelevant images. (c) The number of successful jailbreaks produced by different mapping rules.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: Universality
  • Definition 2: Adaptability