Table of Contents
Fetching ...

Geneshift: Impact of different scenario shift on Jailbreaking LLM

Tianyi Wu, Zhiwei Xue, Yue Liu, Jiaheng Zhang, Bryan Hooi, See-Kiong Ng

TL;DR

GeneShift addresses the vulnerability of LLMs to jailbreak attacks by exploiting scenario shifts through a genetic algorithm. It constructs a transformation gene database and evolves prompt candidates via population-based search, fitness evaluation by a judge LLM, crossover, and mutation, terminating on iteration or performance thresholds. Empirical results on GPT-4o mini show substantial gains in GPT-based harm detection (ASR-GPT ≈ 60%) and dictionary-based metrics (ASR-DICT ≈ 56%), outperforming both white-box and existing black-box baselines. The work highlights the rising sophistication of automated jailbreaking and underscores the need for stronger safety defenses in future LLM deployments.

Abstract

Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.

Geneshift: Impact of different scenario shift on Jailbreaking LLM

TL;DR

GeneShift addresses the vulnerability of LLMs to jailbreak attacks by exploiting scenario shifts through a genetic algorithm. It constructs a transformation gene database and evolves prompt candidates via population-based search, fitness evaluation by a judge LLM, crossover, and mutation, terminating on iteration or performance thresholds. Empirical results on GPT-4o mini show substantial gains in GPT-based harm detection (ASR-GPT ≈ 60%) and dictionary-based metrics (ASR-DICT ≈ 56%), outperforming both white-box and existing black-box baselines. The work highlights the rising sophistication of automated jailbreaking and underscores the need for stronger safety defenses in future LLM deployments.

Abstract

Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.

Paper Structure

This paper contains 19 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Rules of genetic algorithm.
  • Figure 2: Prompt of genetic algorithm initialization.
  • Figure 3: Prompt of genetic algorithm evaluation.
  • Figure 4: Prompt of genetic algorithm crossover.
  • Figure 5: Prompt of genetic algorithm swap.
  • ...and 4 more figures