Query-Based Adversarial Prompt Generation
Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
TL;DR
This work introduces Greedy Coordinate Query (GCQ), a surrogate-free, query-based adversarial prompt generation method that crafts prompts on a remote language model to induce targeted harmful outputs. Building on the Greedy Coordinate Gradient framework, GCQ uses a best-first buffer search with proxy-guided neighborhood exploration and a final exact evaluation, enabling high success rates against OpenAI services and content moderation without requiring surrogate models. The paper demonstrates substantial improvements over transfer-based attacks across open models, GPT-3.5 Turbo, and Llama Guard 7B, including robust universal and nonuniversal attacks on content moderation. It also discusses practical challenges such as logprob estimation, nondeterminism, and initialization, and argues that defenses relying solely on transferability are insufficient against these query-based approaches.
Abstract
Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
