Table of Contents
Fetching ...

An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer

Weipeng Jiang, Zhenting Wang, Juan Zhai, Shiqing Ma, Zhengyu Zhao, Chao Shen

TL;DR

The paper tackles the problem of jailbreaking LLMs in a black-box setting by introducing ECLIPSE, which uses optimizable suffixes generated and refined by LLMs guided through task prompts and continuous feedback from a harmfulness scorer. Unlike template-based or white-box optimization approaches, ECLIPSE avoids predefined affirmative phrases and achieves rapid, high-rate attacks (average ASR $\approx 0.92$) with substantially lower overhead and query counts across multiple models, including GPT-3.5-Turbo. The results show strong effectiveness and efficiency improvements over state-of-the-art optimization-based methods (e.g., GCG) and competitive performance with template-based approaches, while also revealing gaps in current defenses as evidenced by detection rates from common safety checkers. The work highlights the dual-use risk of leveraging LLMs as optimizers for adversarial objectives and underscores the need for robust defense mechanisms against such dynamic, feedback-driven jailbreak strategies.

Abstract

Despite prior safety alignment efforts, mainstream LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by Greedy Coordinate Gradient (GCG), which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. In this paper, we present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. Drawing inspiration from LLMs' powerful generation and optimization capabilities, we employ task prompts to translate jailbreaking goals into natural language instructions. This guides the LLM to generate adversarial suffixes for malicious queries. In particular, a harmfulness scorer provides continuous feedback, enabling LLM self-reflection and iterative optimization to autonomously and efficiently produce effective suffixes. Experimental results demonstrate that ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG in 2.4 times. Moreover, ECLIPSE is on par with template-based methods in ASR while offering superior attack efficiency, reducing the average attack overhead by 83%.

An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer

TL;DR

The paper tackles the problem of jailbreaking LLMs in a black-box setting by introducing ECLIPSE, which uses optimizable suffixes generated and refined by LLMs guided through task prompts and continuous feedback from a harmfulness scorer. Unlike template-based or white-box optimization approaches, ECLIPSE avoids predefined affirmative phrases and achieves rapid, high-rate attacks (average ASR ) with substantially lower overhead and query counts across multiple models, including GPT-3.5-Turbo. The results show strong effectiveness and efficiency improvements over state-of-the-art optimization-based methods (e.g., GCG) and competitive performance with template-based approaches, while also revealing gaps in current defenses as evidenced by detection rates from common safety checkers. The work highlights the dual-use risk of leveraging LLMs as optimizers for adversarial objectives and underscores the need for robust defense mechanisms against such dynamic, feedback-driven jailbreak strategies.

Abstract

Despite prior safety alignment efforts, mainstream LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by Greedy Coordinate Gradient (GCG), which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. In this paper, we present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. Drawing inspiration from LLMs' powerful generation and optimization capabilities, we employ task prompts to translate jailbreaking goals into natural language instructions. This guides the LLM to generate adversarial suffixes for malicious queries. In particular, a harmfulness scorer provides continuous feedback, enabling LLM self-reflection and iterative optimization to autonomously and efficiently produce effective suffixes. Experimental results demonstrate that ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG in 2.4 times. Moreover, ECLIPSE is on par with template-based methods in ASR while offering superior attack efficiency, reducing the average attack overhead by 83%.
Paper Structure (19 sections, 2 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Schematic of LLM-based jailbreaking suffixes generation and optimization.
  • Figure 2: ASR of different attackers.
  • Figure 3: Ablation studies on three hyperparameters: Batchsize $t$, Temperature $t$, and References $r$.
  • Figure 4: PCA visualization of embeddings. Failed jailbreaking prompts are marked in red, while successful are in blue.
  • Figure 5: ASR of ECLIPSE on different versions of GPT-3.5-Turbo.