Table of Contents
Fetching ...

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin

TL;DR

The paper addresses optimization-based jailbreaking of large language models by identifying the limitations of using a single target template. It introduces I-GCG, which combines diverse harmful target templates, an automatic multi-coordinate token-update strategy, and an easy-to-hard initialization to boost jailbreak effectiveness and speed. Empirical results on AdvBench, HarmBench, and NeurIPS Red Teaming Track show near-100% attack success across multiple threat models, outperforming prior approaches. The work underscores the ongoing need for stronger alignment safeguards and provides code to facilitate replication and defense-oriented research.

Abstract

Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

TL;DR

The paper addresses optimization-based jailbreaking of large language models by identifying the limitations of using a single target template. It introduces I-GCG, which combines diverse harmful target templates, an automatic multi-coordinate token-update strategy, and an easy-to-hard initialization to boost jailbreak effectiveness and speed. Empirical results on AdvBench, HarmBench, and NeurIPS Red Teaming Track show near-100% attack success across multiple threat models, outperforming prior approaches. The work underscores the ongoing need for stronger alignment safeguards and provides code to facilitate replication and defense-oriented research.

Abstract

Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.
Paper Structure (18 sections, 8 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of jailbreak attack. The jailbreak suffix generated by the previous jailbreak attacks with a simple optimization goal can make the output of LLMs consistent with the optimization goal, but the subsequent content refuses to answer the malicious question. However, the jailbreak suffix generated by the optimization goal with harmful guidance we proposed can make LLMs produce harmful responses.
  • Figure 2: The difference between GCG and $\mathcal{I}$-GCG. GCG uses the single target template of "Sure" to generate the optimization goal. While our $\mathcal{I}$-GCG uses the diverse target templates containing harmful guidance to generate the optimization goal.
  • Figure 3: Evolution of loss values for different jailbreak suffix initialization with the number of attack iterations.
  • Figure 4: The overview of the proposed automatic multi-coordinate updating strategy.
  • Figure 5: Evolution of loss values for different categories of malicious questions with the number of attack iterations
  • ...and 3 more figures