Table of Contents
Fetching ...

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

EasyJailbreak introduces a modular framework to standardize the construction and evaluation of jailbreak attacks against large language models, enabling reproducible benchmarking across diverse models. By decomposing attacks into Selector, Mutator, Constraint, and Evaluator, it supports multiple attack recipes and cross-model compatibility, backed by an AdvBench-based benchmarking study. The empirical results reveal pervasive vulnerabilities across 10 models, with average breach probabilities around 60% and notable ASRs for GPT-3.5-Turbo and GPT-4, underscoring the urgency for stronger defenses. The work also analyzes evaluator methods and highlights that increased model size does not guarantee improved security, advocating responsible research and proactive defense design.

Abstract

Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

TL;DR

EasyJailbreak introduces a modular framework to standardize the construction and evaluation of jailbreak attacks against large language models, enabling reproducible benchmarking across diverse models. By decomposing attacks into Selector, Mutator, Constraint, and Evaluator, it supports multiple attack recipes and cross-model compatibility, backed by an AdvBench-based benchmarking study. The empirical results reveal pervasive vulnerabilities across 10 models, with average breach probabilities around 60% and notable ASRs for GPT-3.5-Turbo and GPT-4, underscoring the urgency for stronger defenses. The work also analyzes evaluator methods and highlights that increased model size does not guarantee improved security, advocating responsible research and proactive defense design.

Abstract

Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
Paper Structure (27 sections, 4 figures, 3 tables)

This paper contains 27 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of model outputs with and without jailbreak prompts. Jailbreak example is generated from Shen2023DoAN.
  • Figure 2: The framework of EasyJailbreak, which includes three stages: the preparation stage, attack stage, and output stage (from left to right). In the preparation stage, users need to configure the jailbreak settings, e.g., jailbreak instructions (queries), initial prompt template (seeds). In the attack stage, Easyjailbreak iteratively updates the attack input (upper dashed box), attacks the target model, and evaluates the result (lower dashed box) based on the configuration. Finally, users receive a report containing essential information, such as the Attack Success Rate.
  • Figure 3: Screenshot of the web interface of EasyJailbreak, displaying ChatGPT's response to PAIR pair.
  • Figure 4: The ASR (a) and efficiency (b) of jailbreak methods on llama2-7b and llama2-13b.