Table of Contents
Fetching ...

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

Advik Raj Basani, Xiao Zhang

TL;DR

GASP addresses the challenge of efficiently red-teaming LLM safety in a fully black-box setting by learning a Generative SuffixLLM that maps prompts to human-readable adversarial suffixes. It combines AdvSuffix pretraining, latent Bayesian optimization guided by GASPEval, and iterative ORPO fine-tuning to produce effective jailbreak prompts with high readability while minimizing target-model queries. The AdvSuffixes dataset and a structured evaluation framework enable robust, cross-model assessment, and experiments show GASP outperforms baselines across diverse LLMs and defenses, with favorable training/inference efficiency and strong readability. This work provides a scalable, practical tool for red-teaming and evaluating LLM safety, while highlighting directions for defense-oriented adversarial retraining and safer deployment strategies.

Abstract

LLMs have shown impressive capabilities across various natural language processing tasks, yet remain vulnerable to input prompts, known as jailbreak attacks, carefully designed to bypass safety guardrails and elicit harmful responses. Traditional methods rely on manual heuristics but suffer from limited generalizability. Despite being automatic, optimization-based attacks often produce unnatural prompts that can be easily detected by safety filters or require high computational costs due to discrete token optimization. In this paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel automated framework that can efficiently generate human-readable jailbreak prompts in a fully black-box setting. In particular, GASP leverages latent Bayesian optimization to craft adversarial suffixes by efficiently exploring continuous latent embedding spaces, gradually optimizing the suffix prompter to improve attack efficacy while balancing prompt coherence via a targeted iterative refinement procedure. Through comprehensive experiments, we show that GASP can produce natural adversarial prompts, significantly improving jailbreak success over baselines, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

TL;DR

GASP addresses the challenge of efficiently red-teaming LLM safety in a fully black-box setting by learning a Generative SuffixLLM that maps prompts to human-readable adversarial suffixes. It combines AdvSuffix pretraining, latent Bayesian optimization guided by GASPEval, and iterative ORPO fine-tuning to produce effective jailbreak prompts with high readability while minimizing target-model queries. The AdvSuffixes dataset and a structured evaluation framework enable robust, cross-model assessment, and experiments show GASP outperforms baselines across diverse LLMs and defenses, with favorable training/inference efficiency and strong readability. This work provides a scalable, practical tool for red-teaming and evaluating LLM safety, while highlighting directions for defense-oriented adversarial retraining and safer deployment strategies.

Abstract

LLMs have shown impressive capabilities across various natural language processing tasks, yet remain vulnerable to input prompts, known as jailbreak attacks, carefully designed to bypass safety guardrails and elicit harmful responses. Traditional methods rely on manual heuristics but suffer from limited generalizability. Despite being automatic, optimization-based attacks often produce unnatural prompts that can be easily detected by safety filters or require high computational costs due to discrete token optimization. In this paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel automated framework that can efficiently generate human-readable jailbreak prompts in a fully black-box setting. In particular, GASP leverages latent Bayesian optimization to craft adversarial suffixes by efficiently exploring continuous latent embedding spaces, gradually optimizing the suffix prompter to improve attack efficacy while balancing prompt coherence via a targeted iterative refinement procedure. Through comprehensive experiments, we show that GASP can produce natural adversarial prompts, significantly improving jailbreak success over baselines, reducing training times, and accelerating inference speed, thus making it an efficient and scalable solution for red-teaming LLMs.

Paper Structure

This paper contains 36 sections, 8 equations, 19 figures, 8 tables, 3 algorithms.

Figures (19)

  • Figure 1: Summary of the proposed GASP framework: (A) Pre-training of SuffixLLM on AdvSuffixes, (B) Efficient search of adversarial suffixes in a latent space using LBO guided by real-time feedback from TargetLLM, (C) iterative finetuning of SuffixLLM with ORPO using LBO-produced suffixes, and (D) the final SuffixLLM's output distribution is expected to align with TargetLLM.
  • Figure 2: (a) Training and inference times for different TargetLLMs. GCG, AutoDAN, PAIR, and TAP use prompt-specific suffixes, avoiding training. (b) Comparisons of AI-based readability of jailbreak prompts evaluated by Wizard-Vicuna-7B-Uncensored.
  • Figure 3: (a) ASRs of GASP against different closed-sourced LLMs. (b) Performances of GASP against Mistral-7B-v0.3 and Falcon-7B, equipped with diverse defenses listed in Section \ref{['sec:further evlauations']}.
  • Figure 4: Ablations on GASP under various comparative settings: (a) with vs. without LBO, (b) with LBO guided by StrongREJECT vs. by GASPEval, and (c) GASP vs. its finetuning variants.
  • Figure 5: (a) Plot of training loss of baseline SuffixLLM. (b) Plot of Loss and NLL Loss during ORPO training for the target black-box model LLama-3.1-8B, showing changes over iterations.
  • ...and 14 more figures