Gradient-Based Language Model Red Teaming
Nevan Wichers, Carson Denison, Ahmad Beirami
TL;DR
This work tackles the scalability gap in red teaming for language models by introducing Gradient-Based Red Teaming (GBRT), which learns differentiable prompts via backpropagation through a frozen safety classifier and LM using the Gumbel-softmax trick. It presents two enhancements—LM realism loss and a model-based prompt generator—to produce more coherent and realistic prompts. Empirical results show GBRT variants outperform a strong RL-based red teaming baseline in generating prompts that trigger unsafe outputs, even on models trained to be safer, with trade-offs in coherence and diversity. The study demonstrates the practical utility of gradient-guided prompt optimization for robust safety evaluation and model alignment, while acknowledging limitations related to differentiability requirements, language coverage, and potential misuse.
Abstract
Red teaming is a common strategy for identifying weaknesses in generative language models (LMs), where adversarial prompts are produced that trigger an LM to generate unsafe responses. Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by scoring an LM response with a safety classifier and then backpropagating through the frozen safety classifier and LM to update the prompt. To improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. Our experiments show that GBRT is more effective at finding prompts that trigger an LM to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the LM has been fine-tuned to produce safer outputs.
