Table of Contents
Fetching ...

Jailbreaking? One Step Is Enough!

Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, Yongmei Zhou

TL;DR

This work tackles jailbreak vulnerabilities in large language models by introducing REDA, a reverse-embedded defense attack that disguises harmful content within defensive outputs to enable single-step, cross-model jailbreaks. It leverages a Reverse Attack Perspective, Example-Guided Enhancement with a 260 QA reverse dataset, and Request Intent Mitigation to transform prompts into declarative forms that reduce detection. The approach demonstrates high attack efficiency with an average transfer ASR of 96.20% across models and a rapid average query time of 3.12 seconds, outperforming both white-box and black-box baselines and offering insights for strengthening model defenses. The study also provides extensive ablation analyses and outlines future work on multilingual extensions and standardized evaluation standards, highlighting practical implications for safety research and defense design.

Abstract

Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model's defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense". intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.

Jailbreaking? One Step Is Enough!

TL;DR

This work tackles jailbreak vulnerabilities in large language models by introducing REDA, a reverse-embedded defense attack that disguises harmful content within defensive outputs to enable single-step, cross-model jailbreaks. It leverages a Reverse Attack Perspective, Example-Guided Enhancement with a 260 QA reverse dataset, and Request Intent Mitigation to transform prompts into declarative forms that reduce detection. The approach demonstrates high attack efficiency with an average transfer ASR of 96.20% across models and a rapid average query time of 3.12 seconds, outperforming both white-box and black-box baselines and offering insights for strengthening model defenses. The study also provides extensive ablation analyses and outlines future work on multilingual extensions and standardized evaluation standards, highlighting practical implications for safety research and defense design.

Abstract

Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model's defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense". intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.

Paper Structure

This paper contains 42 sections, 14 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 2: The overall architecture of our work. Firstly, we design a reverse attack prompt template that retains structural elements and special characters, enabling task-specific generation of prompts and reducing the prominence of harmful content. Secondly, incorporating in-context learning with relevant QA pairs further refines the prompts by enhancing the model's understanding of defensive contexts. Additionally, to mitigate the attacker's intent in the prompt, we transform the prompt from an interrogative sentence into a declarative sentence. Finally, the content is reconstructed into complete reverse attack prompts.
  • Figure 3: An initial process of generating reverse attacks. The left side represents the reverse attack prompt template that retains structural elements and special characters, while the right side shows an example generated from the template on how to rob a bank.
  • Figure 4: The attack efficiency of various methods when transferring successfully generated prompts from the Vicuna model to other target models. The best results are highlighted in bold.
  • Figure 5: An example from our constructed jailbreak dataset. "Origin-Question" represents the original jailbreak prompt and "Question" represents the jailbreak prompt updated by REDA.