Table of Contents
Fetching ...

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri

TL;DR

WildTeaming presents a scalable two-stage framework (Mine and Compose) to automatically mine in-the-wild jailbreak tactics and compose them into diverse adversarial prompts, enabling broad red-teaming of frontier LLMs. It introduces WildJailbreak, a large-scale open safety dataset with four contrastive data types (vanilla/adversarial, harmful/benign) totaling 262K examples, designed to study safety training and evaluation. Key findings show that balancing vanilla and adversarial safety data yields the strongest defenses and that scaling safety data improves safety with minimal loss in general capabilities, especially when data are mixed holistically. The work advocates openness in safety resources, evolving evaluation methodologies, and deeper understanding of safety-alignment mechanisms to keep pace with advancing model capabilities.

Abstract

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

TL;DR

WildTeaming presents a scalable two-stage framework (Mine and Compose) to automatically mine in-the-wild jailbreak tactics and compose them into diverse adversarial prompts, enabling broad red-teaming of frontier LLMs. It introduces WildJailbreak, a large-scale open safety dataset with four contrastive data types (vanilla/adversarial, harmful/benign) totaling 262K examples, designed to study safety training and evaluation. Key findings show that balancing vanilla and adversarial safety data yields the strongest defenses and that scaling safety data improves safety with minimal loss in general capabilities, especially when data are mixed holistically. The work advocates openness in safety resources, evolving evaluation methodologies, and deeper understanding of safety-alignment mechanisms to keep pace with advancing model capabilities.

Abstract

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.

Paper Structure

This paper contains 77 sections, 1 equation, 14 figures, 36 tables.

Figures (14)

  • Figure 1: The two steps of the WildTeaming framework: Mine (in-the-wild user-written jailbreak tactics) and Compose (jailbreak tactics into diverse adversarial attacks).
  • Figure 1: (Left) shows the number of items (Total), number of deduplicated unique clusters (Uniq.), and per query count (Per.) for jailbreak tactics automatically mined from In-the-Wild user queries in LMSYS-1M and WildChat, which contain a greater diversity and quantity of jailbreak tactics compared to those from other sources. Underline indicates a sub-sampled set of queries. (Right) shows the top common jailbreak tactics and their percentage of occurrence.
  • Figure 2: The breakdown of $\text{ASR}^{@ i}_{30}$ (left) and $\text{Query}^{@ i}_{30}$ (right) for $i \in \{1,2,3,4,5\}$ comparing WildTeaming and PAIR. The left plot shows the ratio of $\text{ASR}^{@ i}_{30}$ between WildTeaming and PAIR, and right plot shows the $\text{Query}^{@ i}_{30}$ of WildTeaming subtracted by that of PAIR. The advantage of WildTeaming emerges more apparent by requiring more unique successful attacks.
  • Figure 3: Ablations of pruners and whether to fix the seed leading sentence tactic for attacking Vicuna-7B with the validation set of HarmBench.
  • Figure 3: Attack success rate (ASR) of adversarial attacks in the WildJailbreak evaluation data against various families and sizes of chat language models.
  • ...and 9 more figures