Table of Contents
Fetching ...

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, Xiaolin Hu

TL;DR

This work tackles jailbreak vulnerabilities in aligned LLMs by improving the efficiency of discrete token optimization. It introduces Faster-GCG, which adds distance-aware gradient guidance, deterministic greedy sampling, and deduplication, plus CW loss, to the original GCG framework. Empirically, Faster-GCG achieves substantially higher attack success rates at roughly one-tenth the computational cost on open LLMs and shows improved transferability to closed models like ChatGPT. The study also discusses limitations, including perceptual detectability of optimized suffixes and the absence of ensemble transfer strategies, highlighting ongoing security implications for LLM alignment and defense design.

Abstract

Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

TL;DR

This work tackles jailbreak vulnerabilities in aligned LLMs by improving the efficiency of discrete token optimization. It introduces Faster-GCG, which adds distance-aware gradient guidance, deterministic greedy sampling, and deduplication, plus CW loss, to the original GCG framework. Empirically, Faster-GCG achieves substantially higher attack success rates at roughly one-tenth the computational cost on open LLMs and shows improved transferability to closed models like ChatGPT. The study also discusses limitations, including perceptual detectability of optimized suffixes and the absence of ensemble transfer strategies, highlighting ongoing security implications for LLM alignment and defense design.

Abstract

Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.

Paper Structure

This paper contains 26 sections, 2 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the jailbreak setting. The black text represents the fixed system prompt template. The blue text denotes the user request containing the harmful instruction, followed by the optimizable adversarial suffix. The goal of jailbreaking is to induce the LLM to generate harmful responses (green content).
  • Figure 2: The comparison between GCG and Faster-GCG. Given an initial suffix (exclamation marks on the left), both GCG and Faster-GCG iteratively update the tokens for several iterations to find an adversarial suffix. The yellow texts show the improved techniques of Faster-GCG over GCG.
  • Figure 3: Average loss comparison between GCG and Faster-GCG on randomly selected 20 behaviors. Here both GCG and Faster-GCG use the same cross-entropy loss. As the iterations progress, the loss of Faster-GCG exhibits a lower value compared to that of GCG.
  • Figure 4: The ASRs on Llama-2-7B-chat using different regularization weight $w$.
  • Figure A1: Three comparison results between GCG and Faster-GCG. The red text represents the optimized adversarial suffix, while the green text denotes part of the feedback generated by the Llama-2-7B-chat after processing prompts that include the adversarial suffix. In response to the suffix optimized by GCG, the LLM refuses the harmful user prompt, whereas the suffix optimized by Faster-GCG successfully induces harmful response.