Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Xiao Li; Zhuhong Li; Qiongxiu Li; Bingze Lee; Jinghao Cui; Xiaolin Hu

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, Xiaolin Hu

TL;DR

This work tackles jailbreak vulnerabilities in aligned LLMs by improving the efficiency of discrete token optimization. It introduces Faster-GCG, which adds distance-aware gradient guidance, deterministic greedy sampling, and deduplication, plus CW loss, to the original GCG framework. Empirically, Faster-GCG achieves substantially higher attack success rates at roughly one-tenth the computational cost on open LLMs and shows improved transferability to closed models like ChatGPT. The study also discusses limitations, including perceptual detectability of optimized suffixes and the absence of ensemble transfer strategies, highlighting ongoing security implications for LLM alignment and defense design.

Abstract

Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

TL;DR

Abstract

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)