Table of Contents
Fetching ...

Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang

TL;DR

This work targets the vulnerability of open-source LLMs to jailbreak prompts by reframing jailbreak generation as a multi-objective optimization that preserves semantic similarity to original questions while ensuring jailbreak validity. It introduces Semantic Mirror Jailbreak (SMJ), a genetic algorithm–based approach that initializes with semantically similar paraphrases and evolves prompts to maximize both similarity and attack success. Empirical results show SMJ outperforms the AutoDAN-GA baseline in ASR and semantic meaningfulness, maintains robustness under defenses like ONION, and exhibits respectable transferability across models. The approach highlights a risk of semantic-preserving prompt attacks and underscores the need for defenses that detect highly similar input patterns and paraphrase-based jailbreak attempts.

Abstract

Large Language Models (LLMs), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.

Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

TL;DR

This work targets the vulnerability of open-source LLMs to jailbreak prompts by reframing jailbreak generation as a multi-objective optimization that preserves semantic similarity to original questions while ensuring jailbreak validity. It introduces Semantic Mirror Jailbreak (SMJ), a genetic algorithm–based approach that initializes with semantically similar paraphrases and evolves prompts to maximize both similarity and attack success. Empirical results show SMJ outperforms the AutoDAN-GA baseline in ASR and semantic meaningfulness, maintains robustness under defenses like ONION, and exhibits respectable transferability across models. The approach highlights a risk of semantic-preserving prompt attacks and underscores the need for defenses that detect highly similar input patterns and paraphrase-based jailbreak attempts.

Abstract

Large Language Models (LLMs), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.
Paper Structure (19 sections, 2 equations, 2 figures, 5 tables, 6 algorithms)

This paper contains 19 sections, 2 equations, 2 figures, 5 tables, 6 algorithms.

Figures (2)

  • Figure 1: An illustration of jailbreak prompt. If querying using a normal harmful question, LLMs will reject answering the question in red. However, if using the existing jailbreak prompt which combines a jailbreak template with the question, LLMs will generate a harmful response. Semantic Mirror Jailbreak (SMJ)'s jailbreak prompt can also reach the same outcome but the prompt would be more semantically meaningful.
  • Figure 2: This paper proposes Semantic Mirror Jailbreak (SMJ), a method that uses paraphrased questions generated by referring to the original question as the initial population to ensure jailbreak prompts' semantic meaningfulness. By subsequently applying fitness evaluation, which takes both jailbreak prompts' semantic similarity and attack validity into consideration, before selection and crossover, this can guarantee both jailbreak prompts' semantic meaningfulness and the attack success rate (ASR) optimized concurrently.