Table of Contents
Fetching ...

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma

TL;DR

This work addresses multi-turn jailbreaking of LLMs by introducing PE-CoA, a pattern-guided framework that couples five empirically derived conversation patterns with semantic objectives to systematically probe model vulnerabilities. By formalizing pattern-aware attack formulation, pattern–harm category interactions, and a Pattern Manager-driven attack process, the authors analyze vulnerability profiles across 12 architectures and 10 harm categories, revealing that robustness to one pattern does not generalize to others and that model families share similar failure modes. PE-CoA extends the Chain of Attack by integrating pattern progression with semantic scoring, achieving substantial attack success and enhanced cross-model transfer relative to prior methods. The findings underscore the insufficiency of pattern-agnostic safety measures and advocate for pattern-aware defenses and targeted safety training to mitigate structured conversational manipulation. The work provides a foundation for defense-aware red-teaming and highlights practical implications for developing pattern-sensitive safety mechanisms in real-world LLM deployments.

Abstract

Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

TL;DR

This work addresses multi-turn jailbreaking of LLMs by introducing PE-CoA, a pattern-guided framework that couples five empirically derived conversation patterns with semantic objectives to systematically probe model vulnerabilities. By formalizing pattern-aware attack formulation, pattern–harm category interactions, and a Pattern Manager-driven attack process, the authors analyze vulnerability profiles across 12 architectures and 10 harm categories, revealing that robustness to one pattern does not generalize to others and that model families share similar failure modes. PE-CoA extends the Chain of Attack by integrating pattern progression with semantic scoring, achieving substantial attack success and enhanced cross-model transfer relative to prior methods. The findings underscore the insufficiency of pattern-agnostic safety measures and advocate for pattern-aware defenses and targeted safety training to mitigate structured conversational manipulation. The work provides a foundation for defense-aware red-teaming and highlights practical implications for developing pattern-sensitive safety mechanisms in real-world LLM deployments.

Abstract

Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA

Paper Structure

This paper contains 53 sections, 6 equations, 24 figures, 22 tables.

Figures (24)

  • Figure 1: An example of multi-turn jailbreaking attacks following different conversation patterns toward the same harmful objective. More examples in Figure \ref{['fig:pattern_example']}, \ref{['fig:pattern_intro']} and Appendix \ref{['app:examples']}.
  • Figure 2: Pattern vulnerability profiles in terms of Attack Success Rate across target models on GCG50 dataset
  • Figure 3: Selected Pattern-harm category analysis in close-source models (Full results, Open-source models on Fig \ref{['figa_a']})
  • Figure 4: Model family vulnerability inheritance across conversational patterns and selected harm categories.(Full on Figure \ref{['figc']}).
  • Figure 5: Cumulative Attack Success Rate (ASR@any) Breakdown by Pattern Contribution Across Large Language Models. This stacked bar chart demonstrates how individual conversational patterns contribute to the overall ASR@any metric for each evaluated model. Each bar represents one of twelve LLMs, with the total height corresponding to the ASR@any achievement (displayed as percentages above each bar, ranging from 75.0% to 100.0%). Patterns are stacked in model-specific performance order, with each model's highest-performing pattern positioned at the base of the bar. The numerical values within each segment indicate the marginal contribution of that pattern—the additional percentage of targets successfully attacked beyond those already covered by higher-ranked patterns.
  • ...and 19 more figures