Distract Large Language Models for Automatic Jailbreak Attack
Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen
TL;DR
This paper introduces Distraction based Adversarial Prompts (DAP), a black-box red-teaming framework that automatically generates universal jailbreak templates by concealing malicious content in complex contexts and reframing memory attention, followed by iterative optimization guided by a judgement model. DAP demonstrates strong attack efficacy across open-source and closed-source LLMs, with Top-1 ASR around 64% and Top-5 ASR around 77% on GPT-3.5-series targets and notable transferability to GPT-4 and other models. The study also analyzes transferability across queries and models, investigates combinations with other attack methods, and evaluates defense approaches, highlighting that current defenses partially mitigate but do not neutralize distraction-based jailbreaks. The findings underscore a need for robust defense strategies and broader safety evaluations to counter such automated red-teaming approaches in real-world deployments.
Abstract
Extensive efforts have been made before the public release of Large language models (LLMs) to align their behaviors with human values. However, even meticulously aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking, leading to unintended behaviors. In this work, we propose a novel black-box jailbreak framework for automated red teaming of LLMs. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs, motivated by the research about the distractibility and over-confidence phenomenon of LLMs. Extensive experiments of jailbreaking both open-source and proprietary LLMs demonstrate the superiority of our framework in terms of effectiveness, scalability and transferability. We also evaluate the effectiveness of existing jailbreak defense methods against our attack and highlight the crucial need to develop more effective and practical defense strategies.
