Table of Contents
Fetching ...

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma

TL;DR

EvoSynth is an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods, and features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure.

Abstract

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

TL;DR

EvoSynth is an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods, and features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure.

Abstract

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

Paper Structure

This paper contains 51 sections, 12 equations, 4 figures, 11 tables, 4 algorithms.

Figures (4)

  • Figure 1: An overview of our proposed EvoSynth method. The process begins with the Reconnaissance Agent formulating a strategy. The Algorithm Creation Agent then generates an executable attack algorithm, which is refined through an evolutionary loop. The Exploitation Agent selects and deploys the algorithm against a target LLM. Finally, a Coordinator uses the judge's evaluation to update the Algorithm Arsenal and guide the next iteration of the attack.
  • Figure 2: Diversity Comparison of Generated Attack Prompts. The raincloud plot shows the distribution of pairwise diversity scores for prompts from the X-Teaming dataset and those generated by EvoSynth. The wider distribution and higher median score for EvoSynth indicate that our framework synthesizes a more semantically diverse and non-redundant set of attacks.
  • Figure 3: Cumulative Convergence of Attack Success. The plots show the cumulative percentage of sessions that have achieved their highest score by a given point in time. (Left) Convergence by the tool's code evolution iteration number. (Right) Convergence by the total number of agent actions taken in the session. Both plots demonstrate rapid convergence, with the majority of optimal attacks being discovered early in the process.
  • Figure 4: Cumulative Distribution of Attack Algorithm Transferability. The plot shows the cumulative percentage of all synthesized attack algorithms (y-axis) that meet or exceed a given usage percentage (x-axis). The curve demonstrates that while many algorithms are specialized, a significant portion are highly transferable, with 20% of all algorithms being effective enough to be used on over 80%.