Table of Contents
Fetching ...

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Lingyao Li, Yongfeng Zhang

TL;DR

The paper introduces a dual-framework approach to evaluate jailbreak prompt effectiveness on large language systems, coupling a coarse-grained cross-model metric with a fine-grained per-prompt analysis. It also provides a ground-truth jailbreak dataset spanning diverse scenarios to benchmark responses beyond binary outcomes. Through empirical results, the authors show that their metrics align with traditional baselines while revealing nuanced risks and scenario-specific vulnerabilities, notably in Political Lobbying prompts. The work further details weighting schemes for multi-model systems and presents appendices with judgment prompts and evaluation protocols, offering a practical toolkit for robust jailbreak assessment and defense-oriented research.

Abstract

Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs). To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the attacking prompts' effectiveness. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset is a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in prompt injection.

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

TL;DR

The paper introduces a dual-framework approach to evaluate jailbreak prompt effectiveness on large language systems, coupling a coarse-grained cross-model metric with a fine-grained per-prompt analysis. It also provides a ground-truth jailbreak dataset spanning diverse scenarios to benchmark responses beyond binary outcomes. Through empirical results, the authors show that their metrics align with traditional baselines while revealing nuanced risks and scenario-specific vulnerabilities, notably in Political Lobbying prompts. The work further details weighting schemes for multi-model systems and presents appendices with judgment prompts and evaluation protocols, offering a practical toolkit for robust jailbreak assessment and defense-oriented research.

Abstract

Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs). To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the attacking prompts' effectiveness. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset is a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in prompt injection.
Paper Structure (32 sections, 8 equations, 4 figures, 5 tables)

This paper contains 32 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The comparison between the coarse-grained metric and the binary baseline. The baseline is depicted as bars at scores 0 and 1. Our metric is represented by the line. The red line superimposed on the bar graph visualizes the aggregated percentages before and after the demarcation point.
  • Figure 2: Results of fine-grained evaluation metric with ground truth. The vertical axis indicates the percentage of attack prompts, while the horizontal axis depicts the range of scores. Each figure compares our fine-grained metric with ground truth and the binary baseline metric on a specific model. The baseline is depicted as bars at scores 0 and 1, while our metric is represented by the line. The red line superimposed on the bar graph visualizes the aggregated percentages before and after the demarcation point.
  • Figure 3: Results of fine-grained evaluation metric without ground truth.
  • Figure 4: The correlation between our evaluation metrics and the baseline. Each color represents one of our evaluation metrics. The x-axis shows the prompts that correspond to our evaluation scores, while the y-axis represents the prompts associated with the baseline scores.