Jailbreaking with Universal Multi-Prompts
Yu-Ling Hsu, Hsuan Su, Shang-Tse Chen
TL;DR
This work tackles the safety challenges of large language models by proposing a universal jailbreak framework (JUMP) that optimizes multi-prompts to transfer across unseen tasks, built as an extension of BEAST. It further introduces DUMP, a defense method that learns universal defense prompts to mitigate such attacks, forming a two-front approach to jailbreak-and-defend in a universal setting. Empirical results on AdvBench across diverse open- and closed-source models show JUMP and its variants (including JUMP++) achieve strong attack performance, while DUMP significantly reduces attack success, highlighting practical implications for both offense and defense in LLM safety. The framework emphasizes transferability, prompt stealth (via perplexity constraints), and the trade-offs between attack efficacy and naturalness, informing future directions for safer, more robust LLM deployments.
Abstract
Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.
