LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

Miao Yu; Junfeng Fang; Yingjie Zhou; Xing Fan; Kun Wang; Shirui Pan; Qingsong Wen

LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, Qingsong Wen

TL;DR

This paper addresses the challenge of jailbreak attacks on safety-aligned LLMs by introducing LLM-Virus, an evolutionary jailbreak framework that leverages LLMs as evolutionary operators (mutation, crossover, fitness evaluation) to optimize jailbreak templates. By framing jailbreak evolution as a transfer learning problem, the method uses a Strain Collection to seed diverse, concise templates, Local Evolution on a reduced dataset for efficient search, and Generalized Infection to assess cross-model transfer on a full dataset. Empirical results on HarmBench and AdvBench show competitive or superior toxicity and strong transferability to several hosts, along with favorable perplexity and significantly lower runtime due to parallelized, localized evolution. Ablation and case studies reinforce the value of the three framework components and demonstrate how templates evolve to higher effectiveness while becoming shorter, highlighting practical implications for strengthening LLM safety against adaptive threats.

Abstract

While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.

LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 9 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Preliminaries
LLM-Virus Framework
Strain Collection
Local Evolution
Fitness
Crossover/Mutation
Selection
Generalized Infection
Trade-off between Cost and Transferability
Experiment
Experimental Setups
Local Evolution Dynamic
Generalized Infection Performance
...and 6 more sections

Figures (5)

Figure 1: Illustration of different ways for malicious querying (direct attack, normal jailbreak and evolutionary jailbreak).
Figure 2: Overview of LLM-Virus. General workflow of jailbreak attacks (Top) and three steps to search for more effective jailbreak templates (Bottom). We demonstrate the LLM system prompts for fitness, mutation and crossover in Step II.
Figure 3: LLM-Virus dynamic of $\textbf{ASR}_l$ and template length on part of AdvBench ($\mathcal{D}_r$) in Step II (Local Evolution).
Figure 4: Jailbreak attack transferability ($\text{ASR}_l$) from original host LLM to new host LLM on AdvBench.
Figure 5: Ablation study of LLM-Virus ($\textbf{ASR}_c$) on part of HarmBench in Local Evolution (Top) and case study(Bottom).

LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

TL;DR

Abstract

LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)