Table of Contents
Fetching ...

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

TL;DR

This work introduces X‑Teaming Evolutionary M2S, an automated framework that discovers and optimizes multi-turn to single-turn jailbreak templates through an LLM‑guided evolutionary loop. By coupling a StrongREJECT‑style judge with a fixed GPT‑4.1 and a $\theta=0.70$ threshold, the pipeline achieves structured, reproducible gains across five generations, including two novel template families and a 44.8% success rate on GPT‑4.1 across 230 trials. Cross‑model evaluation on five targets with 2,500 prompts shows that structure‑level improvements transfer variably across models, highlighting the importance of model‑dependent defenses and length‑aware judging. The authors release full artifacts and emphasize responsible use, including defensive testing and integration with guardrails to stress‑test tamper‑resistant safeguards. Overall, the work demonstrates that automated, auditable search over M2S templates can strengthen single‑turn probes and informs best practices for calibration and cross‑model evaluation in safety testing.

Abstract

Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

TL;DR

This work introduces X‑Teaming Evolutionary M2S, an automated framework that discovers and optimizes multi-turn to single-turn jailbreak templates through an LLM‑guided evolutionary loop. By coupling a StrongREJECT‑style judge with a fixed GPT‑4.1 and a threshold, the pipeline achieves structured, reproducible gains across five generations, including two novel template families and a 44.8% success rate on GPT‑4.1 across 230 trials. Cross‑model evaluation on five targets with 2,500 prompts shows that structure‑level improvements transfer variably across models, highlighting the importance of model‑dependent defenses and length‑aware judging. The authors release full artifacts and emphasize responsible use, including defensive testing and integration with guardrails to stress‑test tamper‑resistant safeguards. Overall, the work demonstrates that automated, auditable search over M2S templates can strengthen single‑turn probes and informs best practices for calibration and cross‑model evaluation in safety testing.

Abstract

Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to , we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

Paper Structure

This paper contains 63 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Cross-model success at $\theta\!\ge\!0.70$ (judge fixed to GPT-4.1). Entries show success rates per (template, model) cell with 100 prompts; cells equal to $0$ are annotated IMMUNE.
  • Figure 2: Model vulnerability at $\theta\!\ge\!0.70$ (judge fixed). Macro-averaged success rate by target model (averaged over templates), with 95% Wilson CIs; each bar aggregates $N{=}5\times100$ prompts. Bars with zero success are annotated IMMUNE.
  • Figure 3: Template ranking across models at $\theta\!\ge\!0.70$ (judge fixed). Macro-averaged success rate by template family (averaged over models), with 95% Wilson CIs; each bar aggregates $N{=}5\times100$ prompts.
  • Figure 4: Comprehensive panel (for reference). Heatmap, baseline vs. evolved by model, template ranking, and model vulnerability shown together. Rates are panel-specific and intended for relative comparisons.