Table of Contents
Fetching ...

DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization

Xinzhe Huang, Kedong Xiu, Tianhang Zheng, Churui Zeng, Wangze Ni, Zhan Qin, Kui Ren, Chun Chen

TL;DR

DualBreach addresses the dual-jailbreaking problem by jointly bypassing guardrails and LLM safety through Target-driven Initialization (TDI) and Multi-Target Optimization (MTO), achieving high attack efficacy with minimal queries. It introduces proxy guardrails and data-distillation-based training to simulate black-box defenses, enabling offline gradient-based optimization that greatly reduces online query costs. The work also presents EGuard, an ensemble guardrail that fuses five defenses via dynamic weighting and boosting to substantially raise resistance to jailbreak prompts. Extensive experiments across multiple datasets, guardrails, and frontier LLMs demonstrate superior dual-jailbreak performance of DualBreach and the robustness of evaluation methodologies, while revealing vulnerabilities that motivate stronger guardrails. Overall, the study advances understanding of LLM security by coupling attack and defense in a unified, efficient framework with practical implications for deploying safer AI systems.

Abstract

Recent research has focused on exploring the vulnerabilities of Large Language Models (LLMs), aiming to elicit harmful and/or sensitive content from LLMs. However, due to the insufficient research on dual-jailbreaking -- attacks targeting both LLMs and Guardrails, the effectiveness of existing attacks is limited when attempting to bypass safety-aligned LLMs shielded by guardrails. Therefore, in this paper, we propose DualBreach, a target-driven framework for dual-jailbreaking. DualBreach employs a Target-driven Initialization (TDI) strategy to dynamically construct initial prompts, combined with a Multi-Target Optimization (MTO) method that utilizes approximate gradients to jointly adapt the prompts across guardrails and LLMs, which can simultaneously save the number of queries and achieve a high dual-jailbreaking success rate. For black-box guardrails, DualBreach either employs a powerful open-sourced guardrail or imitates the target black-box guardrail by training a proxy model, to incorporate guardrails into the MTO process. We demonstrate the effectiveness of DualBreach in dual-jailbreaking scenarios through extensive evaluation on several widely-used datasets. Experimental results indicate that DualBreach outperforms state-of-the-art methods with fewer queries, achieving significantly higher success rates across all settings. More specifically, DualBreach achieves an average dual-jailbreaking success rate of 93.67% against GPT-4 with Llama-Guard-3 protection, whereas the best success rate achieved by other methods is 88.33%. Moreover, DualBreach only uses an average of 1.77 queries per successful dual-jailbreak, outperforming other state-of-the-art methods. For the purpose of defense, we propose an XGBoost-based ensemble defensive mechanism named EGuard, which integrates the strengths of multiple guardrails, demonstrating superior performance compared with Llama-Guard-3.

DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization

TL;DR

DualBreach addresses the dual-jailbreaking problem by jointly bypassing guardrails and LLM safety through Target-driven Initialization (TDI) and Multi-Target Optimization (MTO), achieving high attack efficacy with minimal queries. It introduces proxy guardrails and data-distillation-based training to simulate black-box defenses, enabling offline gradient-based optimization that greatly reduces online query costs. The work also presents EGuard, an ensemble guardrail that fuses five defenses via dynamic weighting and boosting to substantially raise resistance to jailbreak prompts. Extensive experiments across multiple datasets, guardrails, and frontier LLMs demonstrate superior dual-jailbreak performance of DualBreach and the robustness of evaluation methodologies, while revealing vulnerabilities that motivate stronger guardrails. Overall, the study advances understanding of LLM security by coupling attack and defense in a unified, efficient framework with practical implications for deploying safer AI systems.

Abstract

Recent research has focused on exploring the vulnerabilities of Large Language Models (LLMs), aiming to elicit harmful and/or sensitive content from LLMs. However, due to the insufficient research on dual-jailbreaking -- attacks targeting both LLMs and Guardrails, the effectiveness of existing attacks is limited when attempting to bypass safety-aligned LLMs shielded by guardrails. Therefore, in this paper, we propose DualBreach, a target-driven framework for dual-jailbreaking. DualBreach employs a Target-driven Initialization (TDI) strategy to dynamically construct initial prompts, combined with a Multi-Target Optimization (MTO) method that utilizes approximate gradients to jointly adapt the prompts across guardrails and LLMs, which can simultaneously save the number of queries and achieve a high dual-jailbreaking success rate. For black-box guardrails, DualBreach either employs a powerful open-sourced guardrail or imitates the target black-box guardrail by training a proxy model, to incorporate guardrails into the MTO process. We demonstrate the effectiveness of DualBreach in dual-jailbreaking scenarios through extensive evaluation on several widely-used datasets. Experimental results indicate that DualBreach outperforms state-of-the-art methods with fewer queries, achieving significantly higher success rates across all settings. More specifically, DualBreach achieves an average dual-jailbreaking success rate of 93.67% against GPT-4 with Llama-Guard-3 protection, whereas the best success rate achieved by other methods is 88.33%. Moreover, DualBreach only uses an average of 1.77 queries per successful dual-jailbreak, outperforming other state-of-the-art methods. For the purpose of defense, we propose an XGBoost-based ensemble defensive mechanism named EGuard, which integrates the strengths of multiple guardrails, demonstrating superior performance compared with Llama-Guard-3.

Paper Structure

This paper contains 33 sections, 18 equations, 6 figures, 11 tables, 3 algorithms.

Figures (6)

  • Figure 1: Examples of different jailbreaking scenarios. (a) Multiple harmful queries triggering LLM Denial-of-Service. (b) The guardrail directly identifies the harmful intent and rejects the harmful query. (c) The attacker paraphrases the harmful query with plausible scenarios, which appear more benign but may still be rejected by a safety-aligned LLM. (d) DualBreach carefully crafts the jailbreak prompt with limited queries that can bypass the guardrail and induce the target LLM to generate a harmful response.
  • Figure 2: Overview of DualBreach's methodology, which consists of three stages: (1) Initialize harmful queries using the Target-driven Initialization (TDI) strategy, (2) Train proxy guardrails to simulate the behavior of black-box guardrails, and (3) Optimization by (approximate) gradients on guardrails and LLMs. Note that even without the proxy guardrails, DualBreach can still efficiently and effectively attack black-box guardrails by using open-sourced guardrails.
  • Figure 3: EGuard Using Weighted Outputs from Multiple Guardrails to Moderate Jailbreak Prompts.
  • Figure 4: Comparison between PAP and TDI in Rewriting Harmful Queries
  • Figure 5: Example of a Dual Jailbreak Prompt on GPT-4o Optimized by strongreject with DualBreach
  • ...and 1 more figures