X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
TL;DR
This work introduces X‑Teaming Evolutionary M2S, an automated framework that discovers and optimizes multi-turn to single-turn jailbreak templates through an LLM‑guided evolutionary loop. By coupling a StrongREJECT‑style judge with a fixed GPT‑4.1 and a $\theta=0.70$ threshold, the pipeline achieves structured, reproducible gains across five generations, including two novel template families and a 44.8% success rate on GPT‑4.1 across 230 trials. Cross‑model evaluation on five targets with 2,500 prompts shows that structure‑level improvements transfer variably across models, highlighting the importance of model‑dependent defenses and length‑aware judging. The authors release full artifacts and emphasize responsible use, including defensive testing and integration with guardrails to stress‑test tamper‑resistant safeguards. Overall, the work demonstrates that automated, auditable search over M2S templates can strengthen single‑turn probes and informs best practices for calibration and cross‑model evaluation in safety testing.
Abstract
Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.
