Table of Contents
Fetching ...

Enhancing Jailbreak Attacks with Diversity Guidance

Xu Zhang, Dinghao Jing, Xiaojun Wan

TL;DR

This work proposes DPP-based Stochastic Trigger Searching (DSTS), a new optimization algorithm for jailbreak attacks that incorporates diversity guidance through techniques including stochastic gradient search and DPP selection during optimization.

Abstract

As large language models(LLMs) become commonplace in practical applications, the security issues of LLMs have attracted societal concerns. Although extensive efforts have been made to safety alignment, LLMs remain vulnerable to jailbreak attacks. We find that redundant computations limit the performance of existing jailbreak attack methods. Therefore, we propose DPP-based Stochastic Trigger Searching (DSTS), a new optimization algorithm for jailbreak attacks. DSTS incorporates diversity guidance through techniques including stochastic gradient search and DPP selection during optimization. Detailed experiments and ablation studies demonstrate the effectiveness of the algorithm. Moreover, we use the proposed algorithm to compute the risk boundaries for different LLMs, providing a new perspective on LLM safety evaluation.

Enhancing Jailbreak Attacks with Diversity Guidance

TL;DR

This work proposes DPP-based Stochastic Trigger Searching (DSTS), a new optimization algorithm for jailbreak attacks that incorporates diversity guidance through techniques including stochastic gradient search and DPP selection during optimization.

Abstract

As large language models(LLMs) become commonplace in practical applications, the security issues of LLMs have attracted societal concerns. Although extensive efforts have been made to safety alignment, LLMs remain vulnerable to jailbreak attacks. We find that redundant computations limit the performance of existing jailbreak attack methods. Therefore, we propose DPP-based Stochastic Trigger Searching (DSTS), a new optimization algorithm for jailbreak attacks. DSTS incorporates diversity guidance through techniques including stochastic gradient search and DPP selection during optimization. Detailed experiments and ablation studies demonstrate the effectiveness of the algorithm. Moreover, we use the proposed algorithm to compute the risk boundaries for different LLMs, providing a new perspective on LLM safety evaluation.
Paper Structure (37 sections, 15 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 15 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustration of prompt searching for jailbreak attacks. we take the optimization as a path exploration problem. Trigger optimization can be seen as searching for the discrete point with the minimum loss function value in the optimization space. This optimization process is represented as paths among prompts. Without diversity guidance, path 1 will proceed along the dashed line and overlap with path 2. With diversity guidance, the two paths will no longer overlap.
  • Figure 2: An illustration of our proposed method, DPP-based Stochastic Trigger Searching (DSTS). The algorithm involves multiple iterations, with each iteration consisting of three steps: 1) Approximation, 2) Refinement, and 3) Selection. DSTS approximates the optimization objective of all feasible tokens and conducts preliminary filtering to obtain top-k candidates. In step 3), DSTS considers both quality and diversity to select the prompt subset for the next iteration. The optimized trigger is concatenated with the original query to elicit harmful generation.
  • Figure 3: The performance of different jailbreak attack algorithms under various trigger lengths. In the figure, the horizontal axis represents the trigger length, and the vertical axis represents the attack success rate. We plot the results using LLM evaluations on the AdvBench dataset.
  • Figure 4: Risk boundaries of different LLMs evaluated on HEx-PHI. In the Figure, we use abbreviations to represent different instruction domains.