Table of Contents
Fetching ...

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Zhao Xu, Fan Liu, Hao Liu

TL;DR

This work evaluates the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives and highlights the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs.

Abstract

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced $\textbf{JailTrickBench}$ to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

TL;DR

This work evaluates the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives and highlights the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs.

Abstract

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.
Paper Structure (33 sections, 9 equations, 17 figures, 7 tables)

This paper contains 33 sections, 9 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: An overview of benchmarking jailbreak attacks on LLMs.
  • Figure 2: Effect of model size on jailbreak attack performance. We assess two types of jailbreak attacks: token-level (AutoDAN) and prompt-level (PAIR), across the Llama2, Llama3, and Qwen1.5 families with various model sizes. Our analysis indicates that attack robustness does not depend on model size.
  • Figure 3: Effect of finetuning alignment on the robustness of LLMs. We assess the robustness of LLMs before and after fine-tuning by subjecting them to two distinct types of jailbreak attacks—AutoDAN and PAIR—across various configurations. Tulu, Vicuna, and WizardLM models represent the post-fine-tuning versions of Llama2 series. Additionally, we investigate the impact of fine-tuning alignment on Mistral-7B and Qwen1.5-14B, which are fine-tuned using the Tulu v2 SFT dataset, consisting of 326,154 samples. It was observed that fine-tuning significantly compromised the models' safety alignment.
  • Figure 4: Impact of safety system prompts (ssp) on the robustness of LLMs. We evaluate the effect of safety system prompts on the performance of LLMs under token-level and prompt-level attacks. The evaluation is conducted using five LLMs, including Llama-2-7B, Llama-3-8B, Vicuna-7B, Qwen1.5-7B, and Mistral-7B. Our observations indicate that safety system prompts significantly enhance the robustness of LLMs.
  • Figure 5: Effect of template type on the robustness of LLMs. We evaluated whether the template type (original default template and zero-shot template) affects the robustness of LLMs. The evaluation is conducted using five LLMs, including Llama-2-7B, Llama-3-8B, Vicuna-7B, Qwen1.5-7B, and Mistral-7B. Our findings indicate that the original default template enhances the robustness of LLMs, whereas the use of the zero-shot template compromises the LLMs' safety alignment.
  • ...and 12 more figures