Table of Contents
Fetching ...

Gradient-Based Language Model Red Teaming

Nevan Wichers, Carson Denison, Ahmad Beirami

TL;DR

This work tackles the scalability gap in red teaming for language models by introducing Gradient-Based Red Teaming (GBRT), which learns differentiable prompts via backpropagation through a frozen safety classifier and LM using the Gumbel-softmax trick. It presents two enhancements—LM realism loss and a model-based prompt generator—to produce more coherent and realistic prompts. Empirical results show GBRT variants outperform a strong RL-based red teaming baseline in generating prompts that trigger unsafe outputs, even on models trained to be safer, with trade-offs in coherence and diversity. The study demonstrates the practical utility of gradient-guided prompt optimization for robust safety evaluation and model alignment, while acknowledging limitations related to differentiability requirements, language coverage, and potential misuse.

Abstract

Red teaming is a common strategy for identifying weaknesses in generative language models (LMs), where adversarial prompts are produced that trigger an LM to generate unsafe responses. Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by scoring an LM response with a safety classifier and then backpropagating through the frozen safety classifier and LM to update the prompt. To improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. Our experiments show that GBRT is more effective at finding prompts that trigger an LM to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the LM has been fine-tuned to produce safer outputs.

Gradient-Based Language Model Red Teaming

TL;DR

This work tackles the scalability gap in red teaming for language models by introducing Gradient-Based Red Teaming (GBRT), which learns differentiable prompts via backpropagation through a frozen safety classifier and LM using the Gumbel-softmax trick. It presents two enhancements—LM realism loss and a model-based prompt generator—to produce more coherent and realistic prompts. Empirical results show GBRT variants outperform a strong RL-based red teaming baseline in generating prompts that trigger unsafe outputs, even on models trained to be safer, with trade-offs in coherence and diversity. The study demonstrates the practical utility of gradient-guided prompt optimization for robust safety evaluation and model alignment, while acknowledging limitations related to differentiability requirements, language coverage, and potential misuse.

Abstract

Red teaming is a common strategy for identifying weaknesses in generative language models (LMs), where adversarial prompts are produced that trigger an LM to generate unsafe responses. Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by scoring an LM response with a safety classifier and then backpropagating through the frozen safety classifier and LM to update the prompt. To improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. Our experiments show that GBRT is more effective at finding prompts that trigger an LM to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the LM has been fine-tuned to produce safer outputs.
Paper Structure (26 sections, 2 equations, 3 figures, 23 tables)

This paper contains 26 sections, 2 equations, 3 figures, 23 tables.

Figures (3)

  • Figure 1: The GBRT method. Top: the safety classifier, Bottom: LM decoding. The prompt probabilities $X_1$ and $X_2$ shown in red are updated by backpropagation and the other weights are frozen. G means Gumbel softmax. The soft prompt is fed to both the model and the classifier. The gradients are backpropagated from the safety classifier output to the prompt probabilities. RESPONSE is a special token which separates the prompt from the response for the safety classifier.
  • Figure 2: The GBRT-ResponseOnly method. The prompt containing $X_1$ and $X_2$ is fed only to the model. The safety classifier gets the hard-coded word “Hi” no matter what the prompt to the model actually is.
  • Figure 3: The GBRT-Finetune method. The prompt model is used to generate the prompt. The weights shown in red are updated with backpropagation, while the rest are frozen. The prompt model is itself given the fixed prompt to generate its output.