Table of Contents
Fetching ...

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel

TL;DR

The work introduces X-Teaming, a scalable, multi-agent framework for adaptive multi-turn red-teaming of language systems, addressing safety gaps beyond single-turn prompts. It demonstrates state-of-the-art attack effectiveness and diversity, and couples this with XGuard-Train, a large-scale, open-source safety dataset for robust multi-turn alignment. Through extensive HarmBench evaluations and cross-framework experiments, the approach reveals vulnerability patterns and provides practical defenses, while maintaining broad task capabilities. Overall, the paper contributes a concrete, reusable toolkit for advancing multi-turn safety in conversational AI.

Abstract

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

TL;DR

The work introduces X-Teaming, a scalable, multi-agent framework for adaptive multi-turn red-teaming of language systems, addressing safety gaps beyond single-turn prompts. It demonstrates state-of-the-art attack effectiveness and diversity, and couples this with XGuard-Train, a large-scale, open-source safety dataset for robust multi-turn alignment. Through extensive HarmBench evaluations and cross-framework experiments, the approach reveals vulnerability patterns and provides practical defenses, while maintaining broad task capabilities. Overall, the paper contributes a concrete, reusable toolkit for advancing multi-turn safety in conversational AI.

Abstract

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

Paper Structure

This paper contains 51 sections, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: $\mathbb{X}$-Teaming framework: A two-phase approach for multi-turn vulnerability discovery showing (1) Strategic Attack Planning with diverse persona, context, approach, and initial conversation trajectory; and (2) Adaptive Attack Execution with real-time verification and prompt optimization to systematically achieve harmful content generation.
  • Figure 2: Diversity comparison between $\mathbb{X}$-Teaming and ActorAttack for: (a) Plan diversity scores across multiple plans; (b) Attack-level diversity scores across multiple attacker queries.
  • Figure 3: Ablation studies on $\mathbb{X}$-Teaming's attack parameters: (a) Effect of varying the number of attack plans with fixed conversation length (7 turns) and TextGrad disabled; (b) Effect of varying conversation turns with fixed number of plans (10) and TextGrad disabled; (c) Effect of TextGrad optimization attempts with fixed plans (10) and turns (7). All experiments conducted against SafeMTData-tuned Llama-3-8B-Instruct on HarmBench validation set.
  • Figure 4: Heatmap visualization of plan diversity for a single harmful behavior where each cell shows the diversity score (0-1) between pairs of 10 random plans. Higher scores (blue) indicate greater diversity between plans, while lower scores (red) indicate similarity. Plan details on the right show the diverse personas, contexts, and approaches used.
  • Figure 5: Agreement percentages between GPT-4o verifier with HarmBench test classifier and LlamaGuard on HarmBench test set.