Table of Contents
Fetching ...

Untargeted Jailbreak Attack

Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren

TL;DR

Untargeted Jailbreak Attack (UJA) removes rigid target outputs and instead maximizes the unsafety probability of an LLM’s response using a judge model. It decomposes the non-differentiable objective into two differentiable stages: first optimize an optimal unsafe response, then derive a jailbreak prompt through gradient projection to the target LLM’s token space. Empirically, UJA achieves higher jailbreak success with only 100 iterations across multiple LLMs and benchmarks, outperforms state-of-the-art gradient-based attacks, and shows robust transferability and resilience to defenses, illustrating a more effective paradigm for evaluating LLM safety vulnerabilities.

Abstract

Existing gradient-based jailbreak attacks on Large Language Models (LLMs), such as Greedy Coordinate Gradient (GCG) and COLD-Attack, typically optimize adversarial suffixes to align the LLM output with a predefined target response. However, by restricting the optimization objective as inducing a predefined target, these methods inherently constrain the adversarial search space, which limit their overall attack efficacy. Furthermore, existing methods typically require a large number of optimization iterations to fulfill the large gap between the fixed target and the original model response, resulting in low attack efficiency. To overcome the limitations of targeted jailbreak attacks, we propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns. Specifically, we formulate an untargeted attack objective to maximize the unsafety probability of the LLM response, which can be quantified using a judge model. Since the objective is non-differentiable, we further decompose it into two differentiable sub-objectives for optimizing an optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to targeted jailbreak attacks, UJA's unrestricted objective significantly expands the search space, enabling a more flexible and efficient exploration of LLM vulnerabilities.Extensive evaluations demonstrate that UJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks such as I-GCG and COLD-Attack by over 20%.

Untargeted Jailbreak Attack

TL;DR

Untargeted Jailbreak Attack (UJA) removes rigid target outputs and instead maximizes the unsafety probability of an LLM’s response using a judge model. It decomposes the non-differentiable objective into two differentiable stages: first optimize an optimal unsafe response, then derive a jailbreak prompt through gradient projection to the target LLM’s token space. Empirically, UJA achieves higher jailbreak success with only 100 iterations across multiple LLMs and benchmarks, outperforms state-of-the-art gradient-based attacks, and shows robust transferability and resilience to defenses, illustrating a more effective paradigm for evaluating LLM safety vulnerabilities.

Abstract

Existing gradient-based jailbreak attacks on Large Language Models (LLMs), such as Greedy Coordinate Gradient (GCG) and COLD-Attack, typically optimize adversarial suffixes to align the LLM output with a predefined target response. However, by restricting the optimization objective as inducing a predefined target, these methods inherently constrain the adversarial search space, which limit their overall attack efficacy. Furthermore, existing methods typically require a large number of optimization iterations to fulfill the large gap between the fixed target and the original model response, resulting in low attack efficiency. To overcome the limitations of targeted jailbreak attacks, we propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns. Specifically, we formulate an untargeted attack objective to maximize the unsafety probability of the LLM response, which can be quantified using a judge model. Since the objective is non-differentiable, we further decompose it into two differentiable sub-objectives for optimizing an optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to targeted jailbreak attacks, UJA's unrestricted objective significantly expands the search space, enabling a more flexible and efficient exploration of LLM vulnerabilities.Extensive evaluations demonstrate that UJA can achieve over 80% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks such as I-GCG and COLD-Attack by over 20%.

Paper Structure

This paper contains 23 sections, 1 theorem, 13 equations, 10 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

If we approximately consider $p$ and $r$ as continuous variables (i.e., token probability vector) and substitute $L$ with its continuous variant, i.e.,$L$ without output tokenization, then we have the optimal solution to (3) and (5) is also an optimal solution to (2).

Figures (10)

  • Figure 1: Examples of different jailbreak scenarios. (a) White-box attacks toward predefined targets may fail to induce harmful responses under limited iterations. (b) Black-box attacks optimize harmful queries with plausible scenarios but may still be rejected by safety-aligned LLMs. (c) UJA crafts prompts that induce harmful responses within limited iterations.
  • Figure 2: Overview of UJA's methodology, which consists of two stages: (1) Optimization unsafe response $r^*$ by (approximate) gradients on judge models. (2) Apply gradient projection on the target LLM to approximately optimize the jailbreak prompt $p^*$.
  • Figure 3: The gradient projection matrix aligning judge model and target LLM tokenizations.
  • Figure 4: t-SNE visualization of response embeddings generated by six jailbreak methods on the AdvBench dataset.
  • Figure 5: Convergence of cumulative ASR of UJA on four LLMs from the AdvBench dataset: (a) ASR-G and (b) ASR-H.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof