Effective and Efficient Jailbreaks of Black-Box LLMs with Cross-Behavior Attacks
Vasudev Gohil
TL;DR
This work tackles the challenge of safely evaluating LLMs by proposing JCB, a black-box jailbreak framework that uses cross-behavior attacks to efficiently and automatically discover jailbreak prompts without internal model access. JCB operates via four components—seed prompt generation, selection, perturbation, and execution/evaluation—enabled by simple, fast perturbations and a feedback loop that reuses past successes across different harmful behaviors. Empirical results on HarmBench and AdvBench show that JCB outperforms state-of-the-art black-box baselines with up to 94% fewer queries and up to 12.9% higher average attack success, achieving notable resilience against stubborn models like Llama-2-7B-Chat and demonstrating zero-shot transferability across LLMs. While the approach emphasizes efficiency and practical deployment, it also highlights ongoing risks and the need for stronger defenses, offering a foundation for robust red-teaming and improved alignment in real-world settings.
Abstract
Despite recent advancements in Large Language Models (LLMs) and their alignment, they can still be jailbroken, i.e., harmful and toxic content can be elicited from them. While existing red-teaming methods have shown promise in uncovering such vulnerabilities, these methods struggle with limited success and high computational and monetary costs. To address this, we propose a black-box Jailbreak method with Cross-Behavior attacks (JCB), that can automatically and efficiently find successful jailbreak prompts. JCB leverages successes from past behaviors to help jailbreak new behaviors, thereby significantly improving the attack efficiency. Moreover, JCB does not rely on time- and/or cost-intensive calls to auxiliary LLMs to discover/optimize the jailbreak prompts, making it highly efficient and scalable. Comprehensive experimental evaluations show that JCB significantly outperforms related baselines, requiring up to 94% fewer queries while still achieving 12.9% higher average attack success. JCB also achieves a notably high 37% attack success rate on Llama-2-7B, one of the most resilient LLMs, and shows promising zero-shot transferability across different LLMs.
