Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Zhao Xu; Fan Liu; Hao Liu

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Zhao Xu, Fan Liu, Hao Liu

TL;DR

This work evaluates the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives and highlights the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs.

Abstract

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced $\textbf{JailTrickBench}$ to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

TL;DR

Abstract

to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.

Paper Structure (33 sections, 9 equations, 17 figures, 7 tables)

This paper contains 33 sections, 9 equations, 17 figures, 7 tables.

Introduction
Related Work
Preliminaries
Large Language Model
Threat Model
Jailbreak Defense
Benchmarking of Jailbreak Attacks on LLMs
Experiment Setup
Bag of Tricks: Key Factors of Implementation Details of Jailbreak Attack
Target Model Level
Attacker Level
Benchmark of Jailbreak Attack on Defense Methods
Conclusion
Acknowledgments
Experiments Setup
...and 18 more sections

Figures (17)

Figure 1: An overview of benchmarking jailbreak attacks on LLMs.
Figure 2: Effect of model size on jailbreak attack performance. We assess two types of jailbreak attacks: token-level (AutoDAN) and prompt-level (PAIR), across the Llama2, Llama3, and Qwen1.5 families with various model sizes. Our analysis indicates that attack robustness does not depend on model size.
Figure 3: Effect of finetuning alignment on the robustness of LLMs. We assess the robustness of LLMs before and after fine-tuning by subjecting them to two distinct types of jailbreak attacks—AutoDAN and PAIR—across various configurations. Tulu, Vicuna, and WizardLM models represent the post-fine-tuning versions of Llama2 series. Additionally, we investigate the impact of fine-tuning alignment on Mistral-7B and Qwen1.5-14B, which are fine-tuned using the Tulu v2 SFT dataset, consisting of 326,154 samples. It was observed that fine-tuning significantly compromised the models' safety alignment.
Figure 4: Impact of safety system prompts (ssp) on the robustness of LLMs. We evaluate the effect of safety system prompts on the performance of LLMs under token-level and prompt-level attacks. The evaluation is conducted using five LLMs, including Llama-2-7B, Llama-3-8B, Vicuna-7B, Qwen1.5-7B, and Mistral-7B. Our observations indicate that safety system prompts significantly enhance the robustness of LLMs.
Figure 5: Effect of template type on the robustness of LLMs. We evaluated whether the template type (original default template and zero-shot template) affects the robustness of LLMs. The evaluation is conducted using five LLMs, including Llama-2-7B, Llama-3-8B, Vicuna-7B, Qwen1.5-7B, and Mistral-7B. Our findings indicate that the original default template enhances the robustness of LLMs, whereas the use of the zero-shot template compromises the LLMs' safety alignment.
...and 12 more figures

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

TL;DR

Abstract

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (17)