Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models
Yiqi Yang, Hongye Fu
TL;DR
The paper tackles the problem of safely evaluating large language models by jailbreaking, identifying the limitations of relying on a single attack method. It introduces an ensemble framework that combines Persuasive Adversarial Prompt (PAP) and a modified Tree of Attacks with Pruning (TAP), augmented by adaptive instruction boosting and semantic perturbation to balance attack efficacy with stealth. Key contributions include the ensemble framework, a stealthiness enhancement strategy, and a selective boosting mechanism, all validated on the CLAS 2024 benchmark with top performance and public release of the code. The work demonstrates that transferable, stealthy jailbreaks can expose a broader set of safety vulnerabilities across diverse LLMs, providing a practical toolkit for comprehensive safety evaluation and benchmarking.
Abstract
We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks. The framework is grounded in three key insights from prior jailbreaking research and practice: ensemble approaches outperform single methods in exposing aligned LLM vulnerabilities, malicious instructions vary in jailbreaking difficulty requiring tailored optimization, and disrupting semantic coherence of malicious prompts can manipulate their embeddings to boost success rates. Validated in the Competition for LLM and Agent Safety 2024, our solution achieved top rankings in the Jailbreaking Attack Track.
