Gradient-Based Language Model Red Teaming

Nevan Wichers; Carson Denison; Ahmad Beirami

Gradient-Based Language Model Red Teaming

Nevan Wichers, Carson Denison, Ahmad Beirami

TL;DR

This work tackles the scalability gap in red teaming for language models by introducing Gradient-Based Red Teaming (GBRT), which learns differentiable prompts via backpropagation through a frozen safety classifier and LM using the Gumbel-softmax trick. It presents two enhancements—LM realism loss and a model-based prompt generator—to produce more coherent and realistic prompts. Empirical results show GBRT variants outperform a strong RL-based red teaming baseline in generating prompts that trigger unsafe outputs, even on models trained to be safer, with trade-offs in coherence and diversity. The study demonstrates the practical utility of gradient-guided prompt optimization for robust safety evaluation and model alignment, while acknowledging limitations related to differentiability requirements, language coverage, and potential misuse.

Abstract

Red teaming is a common strategy for identifying weaknesses in generative language models (LMs), where adversarial prompts are produced that trigger an LM to generate unsafe responses. Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by scoring an LM response with a safety classifier and then backpropagating through the frozen safety classifier and LM to update the prompt. To improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. Our experiments show that GBRT is more effective at finding prompts that trigger an LM to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the LM has been fine-tuned to produce safer outputs.

Gradient-Based Language Model Red Teaming

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 3 figures, 23 tables)

This paper contains 26 sections, 2 equations, 3 figures, 23 tables.

Introduction
Related Work
Gradient-Based Red Teaming (GBRT)
LM realism loss.
Model-based prompts.
Experiment Setup
Baselines
Metrics
Quantitative Analysis
Effectiveness in triggering the model.
Prompt metrics.
Human evaluation of coherence and toxicity.
Attacking a safer model.
Changing prompt and response length.
Effect of generating more responses.
...and 11 more sections

Figures (3)

Figure 1: The GBRT method. Top: the safety classifier, Bottom: LM decoding. The prompt probabilities $X_1$ and $X_2$ shown in red are updated by backpropagation and the other weights are frozen. G means Gumbel softmax. The soft prompt is fed to both the model and the classifier. The gradients are backpropagated from the safety classifier output to the prompt probabilities. RESPONSE is a special token which separates the prompt from the response for the safety classifier.
Figure 2: The GBRT-ResponseOnly method. The prompt containing $X_1$ and $X_2$ is fed only to the model. The safety classifier gets the hard-coded word “Hi” no matter what the prompt to the model actually is.
Figure 3: The GBRT-Finetune method. The prompt model is used to generate the prompt. The weights shown in red are updated with backpropagation, while the rest are frozen. The prompt model is itself given the fixed prompt to generate its output.

Gradient-Based Language Model Red Teaming

TL;DR

Abstract

Gradient-Based Language Model Red Teaming

Authors

TL;DR

Abstract

Table of Contents

Figures (3)