Table of Contents
Fetching ...

Jailbreaking with Universal Multi-Prompts

Yu-Ling Hsu, Hsuan Su, Shang-Tse Chen

TL;DR

This work tackles the safety challenges of large language models by proposing a universal jailbreak framework (JUMP) that optimizes multi-prompts to transfer across unseen tasks, built as an extension of BEAST. It further introduces DUMP, a defense method that learns universal defense prompts to mitigate such attacks, forming a two-front approach to jailbreak-and-defend in a universal setting. Empirical results on AdvBench across diverse open- and closed-source models show JUMP and its variants (including JUMP++) achieve strong attack performance, while DUMP significantly reduces attack success, highlighting practical implications for both offense and defense in LLM safety. The framework emphasizes transferability, prompt stealth (via perplexity constraints), and the trade-offs between attack efficacy and naturalness, informing future directions for safer, more robust LLM deployments.

Abstract

Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.

Jailbreaking with Universal Multi-Prompts

TL;DR

This work tackles the safety challenges of large language models by proposing a universal jailbreak framework (JUMP) that optimizes multi-prompts to transfer across unseen tasks, built as an extension of BEAST. It further introduces DUMP, a defense method that learns universal defense prompts to mitigate such attacks, forming a two-front approach to jailbreak-and-defend in a universal setting. Empirical results on AdvBench across diverse open- and closed-source models show JUMP and its variants (including JUMP++) achieve strong attack performance, while DUMP significantly reduces attack success, highlighting practical implications for both offense and defense in LLM safety. The framework emphasizes transferability, prompt stealth (via perplexity constraints), and the trade-offs between attack efficacy and naturalness, informing future directions for safer, more robust LLM deployments.

Abstract

Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.

Paper Structure

This paper contains 36 sections, 4 equations, 5 figures, 13 tables, 3 algorithms.

Figures (5)

  • Figure 1: Framework of our proposed method, JUMP. We perform a universal jailbreak attack by optimizing universal multi-prompts, framed by a red dashed line. We decompose our training pipeline into four stages: Selector, Mutator, Constraints, and Evaluator, which are detailed in Section \ref{['sec:beast2jump']}.
  • Figure 2: Tradeoffs between perplexity and ASR under different settings.
  • Figure 3: Ablations on the performance of three prompting methods (including JUMP++) under different types of initialization.
  • Figure 4: ASR curves against AutoDAN for the three defense settings: No Defense, SmoothLLM, and DUMP. Solid lines represent ASR evaluated by String Matching, while dashed lines represent ASR evaluated by Llama Guard.
  • Figure :