Table of Contents
Fetching ...

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang

TL;DR

This work exposes a systemic vulnerability in LLM safety by showing that carefully evolved persona prompts can substantially reduce refusal behavior and amplify the success of jailbreak attacks. It introduces a genetic-algorithm framework that initializes with sanitized persona prompts, then iteratively crossovers and mutations prompts to produce highly effective adversarial personas. The evolved prompts generalize across multiple LLMs and synergize with existing jailbreak methods, highlighting a critical need for defenses that consider persona-driven manipulation and cross-method attacks. The results emphasize broader implications for safety alignment and prompt-based defenses in real-world AI deployments.

Abstract

Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at https://github.com/CjangCjengh/Generic_Persona.

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

TL;DR

This work exposes a systemic vulnerability in LLM safety by showing that carefully evolved persona prompts can substantially reduce refusal behavior and amplify the success of jailbreak attacks. It introduces a genetic-algorithm framework that initializes with sanitized persona prompts, then iteratively crossovers and mutations prompts to produce highly effective adversarial personas. The evolved prompts generalize across multiple LLMs and synergize with existing jailbreak methods, highlighting a critical need for defenses that consider persona-driven manipulation and cross-method attacks. The results emphasize broader implications for safety alignment and prompt-based defenses in real-world AI deployments.

Abstract

Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at https://github.com/CjangCjengh/Generic_Persona.

Paper Structure

This paper contains 37 sections, 4 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Persona prompts for jailbreaking. When directly receiving harmful requests (top), LLMs typically issue explicit refusals like “Sorry, I can’t help”. With appropriate persona prompts (bottom), LLMs become more inclined to respond.
  • Figure 2: The proposed framework. The population maintains a constant size $N$ through iterative cycles. Each iteration performs $M$ crossover and $M$ mutation operations to generate $2M$ new prompts, followed by selection that eliminates the lowest-performing $2M$ prompts. This evolutionary process progressively refines persona prompts toward target criteria.
  • Figure 2: Experiments on transferabilities. We evaluate the target LLM using the persona prompt obtained from the genetic algorithm on the source LLM, noted as source $\rightarrow$ target. We also select PAP, the best-performing method from Table \ref{['tab:main_results']}, to evaluate its combined effectiveness with our persona prompt.
  • Figure 3: Evolution of RtA during genetic algorithm iterations on GPT-4o-mini and GPT-4o. For each, we employ 200 harmful prompts mentioned in Section \ref{['sec:datasets']} for iterations to obtain a persona prompt. The algorithm is configured with a population size $N=35$, crossover and mutation counts $M=5$, and iterates for 40 generations. We use a single A6000 GPU to run the RtA classifier and 40 threads for API calls, requiring 4.5 hours to complete 40 iterations.
  • Figure 3: Robustness against defense strategies. We measure the RtA changes of our persona prompts under three common prompt-level defenses on GPT-4o-mini and GPT-4o.
  • ...and 14 more figures