Table of Contents
Fetching ...

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

TL;DR

Mask-GCG tackles token redundancy in adversarial suffixes used for jailbreaking LLMs by introducing learnable masking that identifies high-impact tokens. The framework combines attention-guided initialization and a joint loss to prune low-impact tokens while preserving attack effectiveness, achieving notable efficiency gains (suffix compression up to $10.5\%$ and time reductions around $16.8\%$) and a $24\%$ improvement in suffix perplexity. Across GCG variants and three safety-aligned LLMs, most tokens are informative, with high-impact tokens accounting for about $83\%$ of the suffix, enabling substantial reductions in search space with minimal ASR loss. These results inform both attack efficiency and defensive strategies, suggesting targeted, importance-based filtering to improve robustness and interpretability of prompt design.

Abstract

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

TL;DR

Mask-GCG tackles token redundancy in adversarial suffixes used for jailbreaking LLMs by introducing learnable masking that identifies high-impact tokens. The framework combines attention-guided initialization and a joint loss to prune low-impact tokens while preserving attack effectiveness, achieving notable efficiency gains (suffix compression up to and time reductions around ) and a improvement in suffix perplexity. Across GCG variants and three safety-aligned LLMs, most tokens are informative, with high-impact tokens accounting for about of the suffix, enabling substantial reductions in search space with minimal ASR loss. These results inform both attack efficiency and defensive strategies, suggesting targeted, importance-based filtering to improve robustness and interpretability of prompt design.

Abstract

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

Paper Structure

This paper contains 14 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison between GCG and Mask-GCG. Left: The conventional GCG employs fixed-length suffixes for optimization, where all tokens participate in the gradient update process. Right: Mask-GCG introduces learnable masks to identify and prune low-impact tokens, optimizing suffix length while preserving attack effectiveness.
  • Figure 2: Adversarial suffix generated using GCG and Mask-GCG. GCG optimizes all tokens in a 30-token suffix, while Mask-GCG identifies high-impact tokens and prunes low-impact ones. Numerical values show mask probabilities for retained tokens.
  • Figure 3: Mask Evolution in Harmful Query Attacks. Initial and final mask values for each token position during attack on Llama-2-7B. Purple bars show initial values, red hatched areas indicate increased values, green areas show decreased values, and black dashed boxes highlight pruned tokens.