Curiosity-driven Red-teaming for Large Language Models

Zhang-Wei Hong; Idan Shenfeld; Tsun-Hsuan Wang; Yung-Sung Chuang; Aldo Pareja; James Glass; Akash Srivastava; Pulkit Agrawal

Curiosity-driven Red-teaming for Large Language Models

Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal

TL;DR

The paper tackles the challenge of safely deploying large language models by automating red-teaming with a curiosity-driven approach. It introduces CRT, which adds entropy and novelty rewards to the red-team objective to achieve broader coverage of prompts that elicit toxic outputs, without sacrificing effectiveness. Across text continuation and instruction-following tasks, CRT yields higher prompt diversity and maintains or improves toxicity elicitation, even against models tuned with human preferences. These results underscore the importance of exploration-driven strategies for robust safety evaluation of LLMs and suggest practical avenues for scaling red-teaming efforts.

Abstract

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}

Curiosity-driven Red-teaming for Large Language Models

TL;DR

Abstract

Paper Structure (35 sections, 11 equations, 11 figures, 5 tables)

This paper contains 35 sections, 11 equations, 11 figures, 5 tables.

Introduction
Preliminaries: Red teaming for Large Language Model
Curiosity-driven Exploration for Red Teaming
Novelty rewards
Experiments
General setup
Baselines and implementations.
Benchmark in Text Continuation Task
Setup.
Results.
Benchmark in Instruction Following Tasks
Setup.
Results.
Red teaming against LLMs fine-tuned with human preference
Analysis and Ablation Studies
...and 20 more sections

Figures (11)

Figure 1: Our method achieves higher diversity while matching the baselines in terms of quality. The solid line denote the mean value of $y$-axis and the shade denotes its $95\%$ confidence interval estimated by bootstrapping method. (a) RL-based methods achieve similar percentages of toxic responses across various toxicity thresholds (Section \ref{['subsec:exp:general_setup']}). (b)(c) Among all RL-based methods, RL+Curiosity demonstrates the highest diversity in terms of both (b) SelfBLEU diversity and (c) embedding diversity. See Section \ref{['subsec:exp:cont']} for details.
Figure 2: Our curiosity-driven RL excels in quality and diversity when performing red teaming against target LLMs in instruction-following tasks, where the explanation of solid lines and shade are the same as Figure \ref{['fig:continuation']}. (i.a) & (ii.a) RL+curiosity, consistently outperforms the baselines, producing a higher number of effective test cases at all toxicity thresholds. This demonstrates its ability to create more challenging test cases that trigger responses with higher toxicity. (i.b,i.c) & (ii.b,ii.c) Not only do the test cases generated by our approach exhibit higher average quality, but they also demonstrate higher diversity in terms of both SelfBLEU diversity (b) and embedding diversity (c). In contrast, both RL and RL+TDiv methods lack diversity in the generated test cases. See Section \ref{['subsec:exp:cont']} for details.
Figure 3: None of KL penalty weight $\beta$ can match our method in both quality and diversity. It shows that tweaking $\beta$ cannot achieve both high quality and diversity.
Figure 4: Raising the sampling temperature of red team model $\pi$ increases diversity but falls far short of our curiosity-driven exploration method. RL+Curiosity and RL(T=$0.7$) are trained with a temperature of $0.7$ while RL+Curiosity outperforms RL(T=$2.0$).
Figure 5: Comparison of the combinations of each reward terms (Section \ref{['sec:method']}). SB, Cos, and Ent refer to SelfBLEU reward ($B_{\text{SelfBLEU}}$), cosine similarity reward ($B_{\text{Cos}}$), and entropy bonus. None and SB+Cos+Ent refer to RL and RL+Curiosity in previous experiments in Section \ref{['subsec:exp:cont']}.
...and 6 more figures

Curiosity-driven Red-teaming for Large Language Models

TL;DR

Abstract

Curiosity-driven Red-teaming for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)