AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus; Arman Zharmagambetov; Chuan Guo; Brandon Amos; Yuandong Tian

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian

TL;DR

AdvPrompter tackles the vulnerability of LLMs to jailbreaking by training a secondary model to rapidly generate human-readable adversarial prompts tailored to each instruction. It introduces AdvPrompterTrain, an alternating optimization scheme, and AdvPrompterOpt for efficient suffix generation, achieving fast, adaptable, graybox attacks that transfer across open and closed models. The work also demonstrates that synthetic adversarial data produced by AdvPrompter can be used to改善safety alignment through adversarial training, improving robustness without sacrificing general knowledge. Overall, the approach highlights both the practical risk of current safety gaps in deployed LLMs and a path toward automated defenses via scalable adversarial data generation and fine-tuning.

Abstract

Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well. In this paper, we present a novel method that uses another LLM, called AdvPrompter, to generate human-readable adversarial prompts in seconds. AdvPrompter, which is trained using an alternating optimization algorithm, generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show highly competitive results on the AdvBench and HarmBench datasets, that also transfer to closed-source black-box LLMs. We also show that training on adversarial suffixes generated by AdvPrompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks.

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

TL;DR

Abstract

Paper Structure (49 sections, 16 equations, 9 figures, 12 tables, 4 algorithms)

This paper contains 49 sections, 16 equations, 9 figures, 12 tables, 4 algorithms.

Introduction
Preliminaries
Problem Setting: Jailbreaking Attacks
Transfer Attacks
Model-transfer.
Data-transfer.
Methodology
AdvPrompter: Predicting Adversarial Prompts
AdvPrompterTrain: Training AdvPrompter via Alternating Optimization
AdvPrompterOpt: Generating Adversarial Targets
Experiments
Data.
Models.
Baselines and Evaluation.
Attacking Whitebox TargetLLM
...and 34 more sections

Figures (9)

Figure 1: Top: The fine-tuned AdvPrompter LLM generates an adversarial suffix that elicits a positive response. Bottom:AdvPrompterTrain alternates between generating target suffixes using AdvPrompterOpt and fine-tuning AdvPrompter.
Figure 2: Average time (across TargetLLMs) spent generating an adversarial prompt. Our method uses a trained LLM to generate prompts, while baselines rely on an optimization algorithm.
Figure 3: Top: Performance comparison of different attack methods across various open source TargetLLMs. We report: train/test attack success rates @$k$ (at least one out of $k$ attacks was successful) and perplexity as an indicator of human-readability. Each reported value is averaged over 3 independent training runs. Bottom: Average time (across all TargetLLMs) spent generating a single adversarial prompt. Our method uses a trained LLM to quickly generate new prompts, while baselines rely on an optimization algorithm.
Figure 4: Top: Attack performance metrics (ASR, adversarial loss) and a general knowledge score (MMLU) before and after adversarial fine-tuning on AdvPrompter-generated data. Bottom: Adversarial attack before and after adversarial fine-tuning of the TargetLLM. Reported is ASR@$1$ on the validation set over training iterations (epochs) of the AdvPrompter. The fine-tuned TargetLLM is more robust against our attack.
Figure 5: Evaluation of multi-shot adversarial attacks, reported is ASR@$k$ over $k$. We sample from AdvPrompter$k$ adversarial prompts, the attack is successful if the TargetLLM (Vicuna-7b) responds positively to any of the prompts. "Plain Llama2-7b" denotes the base version of Llama2 (no fine-tuning).
...and 4 more figures

Theorems & Definitions (2)

Remark : Relation to amortized optimization
Remark : Robotics analogy

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

TL;DR

Abstract

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (2)