LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, Qingsong Wen
TL;DR
This paper addresses the challenge of jailbreak attacks on safety-aligned LLMs by introducing LLM-Virus, an evolutionary jailbreak framework that leverages LLMs as evolutionary operators (mutation, crossover, fitness evaluation) to optimize jailbreak templates. By framing jailbreak evolution as a transfer learning problem, the method uses a Strain Collection to seed diverse, concise templates, Local Evolution on a reduced dataset for efficient search, and Generalized Infection to assess cross-model transfer on a full dataset. Empirical results on HarmBench and AdvBench show competitive or superior toxicity and strong transferability to several hosts, along with favorable perplexity and significantly lower runtime due to parallelized, localized evolution. Ablation and case studies reinforce the value of the three framework components and demonstrate how templates evolve to higher effectiveness while becoming shorter, highlighting practical implications for strengthening LLM safety against adaptive threats.
Abstract
While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
