Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee; Minsu Kim; Lynn Cherif; David Dobre; Juho Lee; Sung Ju Hwang; Kenji Kawaguchi; Gauthier Gidel; Yoshua Bengio; Nikolay Malkin; Moksh Jain

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain

TL;DR

The paper tackles automated red-teaming for large language models by addressing the diversity–toxicity trade-off and transferability shortcomings of prior RL-based attackers. It introduces a two-stage, probabilistic approach using Generative Flow Networks (GFlowNets): Stage 1 fine-tunes an attacker LM to sample diverse, high-reward prompts, and Stage 2 applies maximum-likelihood estimation on the collected high-reward prompts to smooth the distribution and improve generalization, mitigating reward-temperature sensitivity. The method yields prompts that are more diverse and more effective at eliciting toxic responses across multiple target LLMs and transfers well to unseen models; safety-tuning using these prompts further enhances robustness against other RL-based red-teaming methods without degrading base capabilities. The work demonstrates practical impact for robust automated red-teaming and rapid adaptation to new guards, with potential extensions to multimodal safety and jailbreak defenses and implications for safety policies and deployment.

Abstract

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

Learning diverse attacks on large language models for robust red-teaming and safety tuning

TL;DR

Abstract

Paper Structure (37 sections, 4 equations, 14 figures, 14 tables, 1 algorithm)

This paper contains 37 sections, 4 equations, 14 figures, 14 tables, 1 algorithm.

Introduction
Related work
Red-teaming.
Jailbreaks.
GFlowNets.
Sampling diverse attacks with GFlowNet fine-tuning
Preliminaries
GFlowNet fine-tuning and smoothing with MLE on collected high-reward prompts
Stage 1: GFlowNet fine-tuning.
Stage 2: Smoothing with MLE.
Experiments
Experimental setup
Task.
Evaluation.
Methods.
...and 22 more sections

Figures (14)

Figure 1: In the first stage, the pretrained attacker LM is fine-tuned as a GFlowNet policy to sample attack prompts. In the second stage, we again fine-tune the pretrained attacker LM to maximize likelihood of high-reward attack prompts collected in the first stage. More examples in \ref{['sec:examples']}.
Figure 2: Percentage of toxic prompts (measuring toxicity) out of $10,000$ samples and pairwise cosine distance of prompts generated by each method (measuring diversity) for (a) Dolly-2-7b, (b) Gemma-it-2b, (c) Llama-2-7b-chat, and (d) Llama-3.1-8B-Instruct target models. Results for GPT-2 in \ref{['tradeoff-gpt2']} in \ref{['app:trade-off']}.
Figure 2: Comparison of different attacker LMs for red-teaming Llama-3.1-8B-Instruct model.
Figure 3: Percentage of prompts out of $10,000$ samples for each toxicity score bin with red-teaming the Llama-2-7b-chat target language model. Results for other target models are included in \ref{['app:threshold']}.
Figure 4: Toxicity rate after adaptation with re-ranking using different target LLMs.
...and 9 more figures

Learning diverse attacks on large language models for robust red-teaming and safety tuning

TL;DR

Abstract

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)