Table of Contents
Fetching ...

Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models

Yiqi Yang, Hongye Fu

TL;DR

The paper tackles the problem of safely evaluating large language models by jailbreaking, identifying the limitations of relying on a single attack method. It introduces an ensemble framework that combines Persuasive Adversarial Prompt (PAP) and a modified Tree of Attacks with Pruning (TAP), augmented by adaptive instruction boosting and semantic perturbation to balance attack efficacy with stealth. Key contributions include the ensemble framework, a stealthiness enhancement strategy, and a selective boosting mechanism, all validated on the CLAS 2024 benchmark with top performance and public release of the code. The work demonstrates that transferable, stealthy jailbreaks can expose a broader set of safety vulnerabilities across diverse LLMs, providing a practical toolkit for comprehensive safety evaluation and benchmarking.

Abstract

We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks. The framework is grounded in three key insights from prior jailbreaking research and practice: ensemble approaches outperform single methods in exposing aligned LLM vulnerabilities, malicious instructions vary in jailbreaking difficulty requiring tailored optimization, and disrupting semantic coherence of malicious prompts can manipulate their embeddings to boost success rates. Validated in the Competition for LLM and Agent Safety 2024, our solution achieved top rankings in the Jailbreaking Attack Track.

Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models

TL;DR

The paper tackles the problem of safely evaluating large language models by jailbreaking, identifying the limitations of relying on a single attack method. It introduces an ensemble framework that combines Persuasive Adversarial Prompt (PAP) and a modified Tree of Attacks with Pruning (TAP), augmented by adaptive instruction boosting and semantic perturbation to balance attack efficacy with stealth. Key contributions include the ensemble framework, a stealthiness enhancement strategy, and a selective boosting mechanism, all validated on the CLAS 2024 benchmark with top performance and public release of the code. The work demonstrates that transferable, stealthy jailbreaks can expose a broader set of safety vulnerabilities across diverse LLMs, providing a practical toolkit for comprehensive safety evaluation and benchmarking.

Abstract

We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks. The framework is grounded in three key insights from prior jailbreaking research and practice: ensemble approaches outperform single methods in exposing aligned LLM vulnerabilities, malicious instructions vary in jailbreaking difficulty requiring tailored optimization, and disrupting semantic coherence of malicious prompts can manipulate their embeddings to boost success rates. Validated in the Competition for LLM and Agent Safety 2024, our solution achieved top rankings in the Jailbreaking Attack Track.

Paper Structure

This paper contains 13 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Diagram of Jailbreak Score distribution
  • Figure 2: Evaluation scores of adopting different judgment models. We use the same target model and evaluation instructions.