GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Raj Basani, Xiao Zhang
TL;DR
GASP addresses the challenge of efficiently red-teaming LLM safety in a fully black-box setting by learning a Generative SuffixLLM that maps prompts to human-readable adversarial suffixes. It combines AdvSuffix pretraining, latent Bayesian optimization guided by GASPEval, and iterative ORPO fine-tuning to produce effective jailbreak prompts with high readability while minimizing target-model queries. The AdvSuffixes dataset and a structured evaluation framework enable robust, cross-model assessment, and experiments show GASP outperforms baselines across diverse LLMs and defenses, with favorable training/inference efficiency and strong readability. This work provides a scalable, practical tool for red-teaming and evaluating LLM safety, while highlighting directions for defense-oriented adversarial retraining and safer deployment strategies.
Abstract
LLMs have shown impressive capabilities across various natural language processing tasks, yet remain vulnerable to input prompts, known as jailbreak attacks, carefully designed to bypass safety guardrails and elicit harmful responses. Traditional methods rely on manual heuristics but suffer from limited generalizability. Despite being automatic, optimization-based attacks often produce unnatural prompts that can be easily detected by safety filters or require high computational costs due to discrete token optimization. In this paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel automated framework that can efficiently generate human-readable jailbreak prompts in a fully black-box setting. In particular, GASP leverages latent Bayesian optimization to craft adversarial suffixes by efficiently exploring continuous latent embedding spaces, gradually optimizing the suffix prompter to improve attack efficacy while balancing prompt coherence via a targeted iterative refinement procedure. Through comprehensive experiments, we show that GASP can produce natural adversarial prompts, significantly improving jailbreak success over baselines, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.
