Table of Contents
Fetching ...

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban

TL;DR

AutoAdv introduces a training-free, black-box framework for automated multi-turn jailbreaking of large language models. By integrating a pattern-learning component, a dynamic temperature controller, and a two-phase rewriting strategy, it continuously refines adversarial prompts across turns. Evaluation across AdvBench/HarmBench prompts and multiple target LLMs shows up to 95% attack success in six turns and clear gains over single-turn approaches, revealing weaknesses in current safety alignments. The work highlights the need for multi-turn-aware defenses and provides a foundation for strengthening model robustness against adaptive red-teaming.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

TL;DR

AutoAdv introduces a training-free, black-box framework for automated multi-turn jailbreaking of large language models. By integrating a pattern-learning component, a dynamic temperature controller, and a two-phase rewriting strategy, it continuously refines adversarial prompts across turns. Evaluation across AdvBench/HarmBench prompts and multiple target LLMs shows up to 95% attack success in six turns and clear gains over single-turn approaches, revealing weaknesses in current safety alignments. The work highlights the need for multi-turn-aware defenses and provides a foundation for strengthening model robustness against adaptive red-teaming.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

Paper Structure

This paper contains 31 sections, 6 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Ablation study of AutoAdv on Llama-3.1-8B across 6 turns.
  • Figure 2: AutoAdv ASR across 6 turns on 4 target LLMs. Conducted with full learning.