Table of Contents
Fetching ...

CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models

Guanghao Zhou, Panjia Qiu, Mingyuan Fan, Cen Chen, Mingyuan Chu, Xin Zhang, Jun Zhou

TL;DR

The paper tackles safety breaches in open-source LLMs by proposing Context-Coherent Jailbreak Attack (CCJA), a token-level optimization in the embedding space of masked language models that uses the MLM decoding head to generate semantically coherent jailbreak prefixes. CCJA balances jailbreak success with readability through a multi-objective loss and a reconstruction constraint, enabling fluent prompts that reliably trigger unsafe outputs. Empirical results on AdvBench across seven open-source LLMs show CCJA achieving superior attack success and improving transferability to closed-source models, including enhancing black-box jailbreak methods. The work highlights significant safety risks posed by advancing open-source LLMs and demonstrates a practical pathway to evaluate and potentially strengthen defenses, with plans to open-source code.

Abstract

Despite explicit alignment efforts for large language models (LLMs), they can still be exploited to trigger unintended behaviors, a phenomenon known as "jailbreaking." Current jailbreak attack methods mainly focus on discrete prompt manipulations targeting closed-source LLMs, relying on manually crafted prompt templates and persuasion rules. However, as the capabilities of open-source LLMs improve, ensuring their safety becomes increasingly crucial. In such an environment, the accessibility of model parameters and gradient information by potential attackers exacerbates the severity of jailbreak threats. To address this research gap, we propose a novel \underline{C}ontext-\underline{C}oherent \underline{J}ailbreak \underline{A}ttack (CCJA). We define jailbreak attacks as an optimization problem within the embedding space of masked language models. Through combinatorial optimization, we effectively balance the jailbreak attack success rate with semantic coherence. Extensive evaluations show that our method not only maintains semantic consistency but also surpasses state-of-the-art baselines in attack effectiveness. Additionally, by integrating semantically coherent jailbreak prompts generated by our method into widely used black-box methodologies, we observe a notable enhancement in their success rates when targeting closed-source commercial LLMs. This highlights the security threat posed by open-source LLMs to commercial counterparts. We will open-source our code if the paper is accepted.

CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models

TL;DR

The paper tackles safety breaches in open-source LLMs by proposing Context-Coherent Jailbreak Attack (CCJA), a token-level optimization in the embedding space of masked language models that uses the MLM decoding head to generate semantically coherent jailbreak prefixes. CCJA balances jailbreak success with readability through a multi-objective loss and a reconstruction constraint, enabling fluent prompts that reliably trigger unsafe outputs. Empirical results on AdvBench across seven open-source LLMs show CCJA achieving superior attack success and improving transferability to closed-source models, including enhancing black-box jailbreak methods. The work highlights significant safety risks posed by advancing open-source LLMs and demonstrates a practical pathway to evaluate and potentially strengthen defenses, with plans to open-source code.

Abstract

Despite explicit alignment efforts for large language models (LLMs), they can still be exploited to trigger unintended behaviors, a phenomenon known as "jailbreaking." Current jailbreak attack methods mainly focus on discrete prompt manipulations targeting closed-source LLMs, relying on manually crafted prompt templates and persuasion rules. However, as the capabilities of open-source LLMs improve, ensuring their safety becomes increasingly crucial. In such an environment, the accessibility of model parameters and gradient information by potential attackers exacerbates the severity of jailbreak threats. To address this research gap, we propose a novel \underline{C}ontext-\underline{C}oherent \underline{J}ailbreak \underline{A}ttack (CCJA). We define jailbreak attacks as an optimization problem within the embedding space of masked language models. Through combinatorial optimization, we effectively balance the jailbreak attack success rate with semantic coherence. Extensive evaluations show that our method not only maintains semantic consistency but also surpasses state-of-the-art baselines in attack effectiveness. Additionally, by integrating semantically coherent jailbreak prompts generated by our method into widely used black-box methodologies, we observe a notable enhancement in their success rates when targeting closed-source commercial LLMs. This highlights the security threat posed by open-source LLMs to commercial counterparts. We will open-source our code if the paper is accepted.

Paper Structure

This paper contains 38 sections, 12 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of our jailbreak attack method. I: Use a seed prompt to guide the LLM in generating an instruction-following prefix $\mathbf{x}$. II: Embed $\mathbf{x}$ into the MLM's hidden state using the embedding layer $f_e$ and the hidden layer $f_h$. III: Calculate the logistic distribution $\Theta$ of the hidden state after adding the perturbation $\delta$ through the decoding head $\mathcal{H}$. Optimize $\delta$ using the decode loss $\mathcal{L}_d$ and the jailbreak loss $\mathcal{L}_j$ to balance the attack performance and readability of the jailbreak prefix.
  • Figure 2: The impact of different initial propmt prefix initialization methods on ASR. ASR-!(%) represents the use of 30 "!" for prompt initialization, and $\Delta$ASR(%) represents the improvement of ASR after using our initialization method.
  • Figure 3: PPL filtering thresholds for different jailbreak attack methods vary across distinct LLMs.
  • Figure 4: The transferability results of ASR for different jailbreak attack methods across various LLMs. The Y-axis represents the jailbreak prompts generated by a specific jailbreak attack method for a particular LLM, while the X-axis denotes different LLM models.
  • Figure 5: The impact of varying $\beta$ on ASR and USE trends across different LLMs.
  • ...and 1 more figures