Distract Large Language Models for Automatic Jailbreak Attack

Zeguan Xiao; Yan Yang; Guanhua Chen; Yun Chen

Distract Large Language Models for Automatic Jailbreak Attack

Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen

TL;DR

This paper introduces Distraction based Adversarial Prompts (DAP), a black-box red-teaming framework that automatically generates universal jailbreak templates by concealing malicious content in complex contexts and reframing memory attention, followed by iterative optimization guided by a judgement model. DAP demonstrates strong attack efficacy across open-source and closed-source LLMs, with Top-1 ASR around 64% and Top-5 ASR around 77% on GPT-3.5-series targets and notable transferability to GPT-4 and other models. The study also analyzes transferability across queries and models, investigates combinations with other attack methods, and evaluates defense approaches, highlighting that current defenses partially mitigate but do not neutralize distraction-based jailbreaks. The findings underscore a need for robust defense strategies and broader safety evaluations to counter such automated red-teaming approaches in real-world deployments.

Abstract

Extensive efforts have been made before the public release of Large language models (LLMs) to align their behaviors with human values. However, even meticulously aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking, leading to unintended behaviors. In this work, we propose a novel black-box jailbreak framework for automated red teaming of LLMs. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs, motivated by the research about the distractibility and over-confidence phenomenon of LLMs. Extensive experiments of jailbreaking both open-source and proprietary LLMs demonstrate the superiority of our framework in terms of effectiveness, scalability and transferability. We also evaluate the effectiveness of existing jailbreak defense methods against our attack and highlight the crucial need to develop more effective and practical defense strategies.

Distract Large Language Models for Automatic Jailbreak Attack

TL;DR

Abstract

Paper Structure (36 sections, 4 figures, 13 tables)

This paper contains 36 sections, 4 figures, 13 tables.

Introduction
Methods
Malicious Content Concealing
Memory-Reframing Mechanism
Iterative Prompt Optimizing
Judgement Model
Experiments
Experimental Setup
Datasets.
Models and Settings.
Baselines.
Metrics.
Main Results
Ablation Study
Key Strategies in Meta Prompt.
...and 21 more sections

Figures (4)

Figure 1: A simplified example of jailbreak prompt given by DAP framework. Different text colors represent complex main task, malicious auxiliary task, and memory-reframing scheme. More details are in Section \ref{['sec:method']}.
Figure 2: The DAP framework has three key components. (a) Malicious query concealing via distraction (Section \ref{['sec:conceal_query']}); (b) LLM memory-reframing mechanism (Section \ref{['sec:mem_reframe']}); (c) Iterative jailbreak prompt optimization (Section \ref{['sec:opt_prompt']}).
Figure 3: ASR curve with respect to varied number of streams $N$ or number of iterations $I$.
Figure 4: Example on how the memory-reframing strategy influences response quality of DAP jailbreak attack. The example above is without memory-reframing, while the example blew is with memory-reframing. Bold denotes the malicious request.

Distract Large Language Models for Automatic Jailbreak Attack

TL;DR

Abstract

Distract Large Language Models for Automatic Jailbreak Attack

Authors

TL;DR

Abstract

Table of Contents

Figures (4)