Table of Contents
Fetching ...

Boosting Jailbreak Attack with Momentum

Yihao Zhang, Zeming Wei

TL;DR

This work targets the efficiency bottleneck of gradient-based jailbreaking attacks on aligned LLMs by introducing Momentum Accelerated GCG (MAC), which adds a momentum term $\mathbf{g} \leftarrow \mu\mathbf{g} + (1-\mu)\mathbf{g_t}$ to stabilize and speed up adversarial suffix optimization. By applying MAC to both single and multi-prompt settings, the authors demonstrate higher attack success rates, faster convergence, and improved transferability, even under several defense mechanisms. The results, evaluated on Vicuna-7b and Mistral-7b with AdvBench prompts, show that moderate momentum (e.g., $\mu=0.6$) yields the best balance of effectiveness and stability. The study provides a stronger white-box evaluation tool for safety testing and red-teaming of LLMs, along with code for reproducibility.

Abstract

Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-known jailbreak attack. In particular, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of the adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous optimization iterations. Specifically, we propose the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which integrates a momentum term into the gradient heuristic to boost and stabilize the random search for tokens in adversarial prompts. Experimental results showcase the notable enhancement achieved by MAC over baselines in terms of attack success rate and optimization efficiency. Moreover, we demonstrate that MAC can still exhibit superior performance for transfer attacks and models under defense mechanisms. Our code is available at https://github.com/weizeming/momentum-attack-llm.

Boosting Jailbreak Attack with Momentum

TL;DR

This work targets the efficiency bottleneck of gradient-based jailbreaking attacks on aligned LLMs by introducing Momentum Accelerated GCG (MAC), which adds a momentum term to stabilize and speed up adversarial suffix optimization. By applying MAC to both single and multi-prompt settings, the authors demonstrate higher attack success rates, faster convergence, and improved transferability, even under several defense mechanisms. The results, evaluated on Vicuna-7b and Mistral-7b with AdvBench prompts, show that moderate momentum (e.g., ) yields the best balance of effectiveness and stability. The study provides a stronger white-box evaluation tool for safety testing and red-teaming of LLMs, along with code for reproducibility.

Abstract

Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-known jailbreak attack. In particular, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of the adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous optimization iterations. Specifically, we propose the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which integrates a momentum term into the gradient heuristic to boost and stabilize the random search for tokens in adversarial prompts. Experimental results showcase the notable enhancement achieved by MAC over baselines in terms of attack success rate and optimization efficiency. Moreover, we demonstrate that MAC can still exhibit superior performance for transfer attacks and models under defense mechanisms. Our code is available at https://github.com/weizeming/momentum-attack-llm.
Paper Structure (12 sections, 5 tables, 3 algorithms)