Table of Contents
Fetching ...

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao

TL;DR

SEMA addresses the real-world threat model of multi-turn jailbreaks by learning open-loop, response-agnostic adversarial plans without external data. It combines prefilling self-tuning to stabilize rollouts with reinforcement learning using an intent-drift-aware reward, anchored by a GRPO objective, to preserve malicious intent across turns. Across AdvBench and HarmBench, SEMA achieves state-of-the-art attack success rates and strong transferability while remaining efficient and reproducible. The framework offers a realistic, automated red-teaming approach for stress-testing safety-aligned LLMs and informing defense design.

Abstract

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

TL;DR

SEMA addresses the real-world threat model of multi-turn jailbreaks by learning open-loop, response-agnostic adversarial plans without external data. It combines prefilling self-tuning to stabilize rollouts with reinforcement learning using an intent-drift-aware reward, anchored by a GRPO objective, to preserve malicious intent across turns. Across AdvBench and HarmBench, SEMA achieves state-of-the-art attack success rates and strong transferability while remaining efficient and reproducible. The framework offers a realistic, automated red-teaming approach for stress-testing safety-aligned LLMs and informing defense design.

Abstract

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
Paper Structure (36 sections, 11 equations, 13 figures, 13 tables)

This paper contains 36 sections, 11 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Overview of SEMA framework. In 1prefilling self-tuning, for each harmful intent $q$, the attacker is fine-tuned by self-generated adversarial prompts with a straightforward system prompt $p_\text{sys}$ and prefilled indexing "1.". In 2reinforcement learning, the attacker learns to generate valid and intent-persistent multi-turn adversarial prompts from the format and intent-drift-aware rewards.
  • Figure 2: Attack Success Rate with $N$ attempts (ASR@N) against GPT-4.1-mini on HarmBench.
  • Figure 3: Ablation studies. (Left) Comparison of average ASR@1 across three victims on AdvBench for varied reward designs. (Middle) Comparison of the training curve with or without Prefilling Self-tuning, when the base attacker model is Llama-3.2-3B-Instruct. (Right) Comparison of ASR@1 against Qwen2.5-3B-Instruct and the # tokens for varied # turns, $T_{\max}$, during training.
  • Figure 4: Real success cases of SEMA from AdvBench on GPT-oss-20B (Left) and from HarmBench on Llama-3.1-8B-Instruct (Right). Key features of adversarial prompts are highlighted with gray.
  • Figure 5: SEMA system prompt template in Jinja format.
  • ...and 8 more figures