DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization
Xinzhe Huang, Kedong Xiu, Tianhang Zheng, Churui Zeng, Wangze Ni, Zhan Qin, Kui Ren, Chun Chen
TL;DR
DualBreach addresses the dual-jailbreaking problem by jointly bypassing guardrails and LLM safety through Target-driven Initialization (TDI) and Multi-Target Optimization (MTO), achieving high attack efficacy with minimal queries. It introduces proxy guardrails and data-distillation-based training to simulate black-box defenses, enabling offline gradient-based optimization that greatly reduces online query costs. The work also presents EGuard, an ensemble guardrail that fuses five defenses via dynamic weighting and boosting to substantially raise resistance to jailbreak prompts. Extensive experiments across multiple datasets, guardrails, and frontier LLMs demonstrate superior dual-jailbreak performance of DualBreach and the robustness of evaluation methodologies, while revealing vulnerabilities that motivate stronger guardrails. Overall, the study advances understanding of LLM security by coupling attack and defense in a unified, efficient framework with practical implications for deploying safer AI systems.
Abstract
Recent research has focused on exploring the vulnerabilities of Large Language Models (LLMs), aiming to elicit harmful and/or sensitive content from LLMs. However, due to the insufficient research on dual-jailbreaking -- attacks targeting both LLMs and Guardrails, the effectiveness of existing attacks is limited when attempting to bypass safety-aligned LLMs shielded by guardrails. Therefore, in this paper, we propose DualBreach, a target-driven framework for dual-jailbreaking. DualBreach employs a Target-driven Initialization (TDI) strategy to dynamically construct initial prompts, combined with a Multi-Target Optimization (MTO) method that utilizes approximate gradients to jointly adapt the prompts across guardrails and LLMs, which can simultaneously save the number of queries and achieve a high dual-jailbreaking success rate. For black-box guardrails, DualBreach either employs a powerful open-sourced guardrail or imitates the target black-box guardrail by training a proxy model, to incorporate guardrails into the MTO process. We demonstrate the effectiveness of DualBreach in dual-jailbreaking scenarios through extensive evaluation on several widely-used datasets. Experimental results indicate that DualBreach outperforms state-of-the-art methods with fewer queries, achieving significantly higher success rates across all settings. More specifically, DualBreach achieves an average dual-jailbreaking success rate of 93.67% against GPT-4 with Llama-Guard-3 protection, whereas the best success rate achieved by other methods is 88.33%. Moreover, DualBreach only uses an average of 1.77 queries per successful dual-jailbreak, outperforming other state-of-the-art methods. For the purpose of defense, we propose an XGBoost-based ensemble defensive mechanism named EGuard, which integrates the strengths of multiple guardrails, demonstrating superior performance compared with Llama-Guard-3.
