Table of Contents
Fetching ...

Query-Based Adversarial Prompt Generation

Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr

TL;DR

This work introduces Greedy Coordinate Query (GCQ), a surrogate-free, query-based adversarial prompt generation method that crafts prompts on a remote language model to induce targeted harmful outputs. Building on the Greedy Coordinate Gradient framework, GCQ uses a best-first buffer search with proxy-guided neighborhood exploration and a final exact evaluation, enabling high success rates against OpenAI services and content moderation without requiring surrogate models. The paper demonstrates substantial improvements over transfer-based attacks across open models, GPT-3.5 Turbo, and Llama Guard 7B, including robust universal and nonuniversal attacks on content moderation. It also discusses practical challenges such as logprob estimation, nondeterminism, and initialization, and argues that defenses relying solely on transferability are insufficient against these query-based approaches.

Abstract

Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.

Query-Based Adversarial Prompt Generation

TL;DR

This work introduces Greedy Coordinate Query (GCQ), a surrogate-free, query-based adversarial prompt generation method that crafts prompts on a remote language model to induce targeted harmful outputs. Building on the Greedy Coordinate Gradient framework, GCQ uses a best-first buffer search with proxy-guided neighborhood exploration and a final exact evaluation, enabling high success rates against OpenAI services and content moderation without requiring surrogate models. The paper demonstrates substantial improvements over transfer-based attacks across open models, GPT-3.5 Turbo, and Llama Guard 7B, including robust universal and nonuniversal attacks on content moderation. It also discusses practical challenges such as logprob estimation, nondeterminism, and initialization, and argues that defenses relying solely on transferability are insufficient against these query-based approaches.

Abstract

Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
Paper Structure (34 sections, 1 equation, 8 figures, 1 table)

This paper contains 34 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Greedy Coordinate Query
  • Figure 2: Harmful strings for open models. We show white-box results in (\ref{['fig:open-llms-direct']}), where we see Llama-2 is more robust than Vicuna. In (\ref{['fig:open-llms-scale-transfer']}), we show transfer attacks within the Vicuna 1.3 model family, where we see that transfer attacks are most successful when the models are of similar size.
  • Figure 3: Attack success rate at generating harmful strings on GPT-3.5 Turbo, as a function of cost and iterations.
  • Figure 4: Tradeoff between attack success rate and target string length for a 20 token prompt. Attacks succeed almost always when shorter than the adversarial prompt, and infrequently when longer.
  • Figure 5: Our optimizations to the GCG attack require about $2\times$ fewer loss queries to reach the same attack success rate. When we remove the gradient information entirely to obtain a fully black-box attack, we still outperform the original GCG by about $30\%$.
  • ...and 3 more figures