AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Dong Shu; Chong Zhang; Mingyu Jin; Zihao Zhou; Lingyao Li; Yongfeng Zhang

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Lingyao Li, Yongfeng Zhang

TL;DR

The paper introduces a dual-framework approach to evaluate jailbreak prompt effectiveness on large language systems, coupling a coarse-grained cross-model metric with a fine-grained per-prompt analysis. It also provides a ground-truth jailbreak dataset spanning diverse scenarios to benchmark responses beyond binary outcomes. Through empirical results, the authors show that their metrics align with traditional baselines while revealing nuanced risks and scenario-specific vulnerabilities, notably in Political Lobbying prompts. The work further details weighting schemes for multi-model systems and presents appendices with judgment prompts and evaluation protocols, offering a practical toolkit for robust jailbreak assessment and defense-oriented research.

Abstract

Jailbreak attacks represent one of the most sophisticated threats to the security of large language models (LLMs). To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the attacking prompts' effectiveness. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset is a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in prompt injection.

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 4 figures, 5 tables)

This paper contains 32 sections, 8 equations, 4 figures, 5 tables.

Introduction
Related Work
Large Language Model's Vulnerability
Jailbreak Attack on Large Language Models
Methodology
Coarse-grained Evaluation Metric
Fine-grained Evaluation Metric
Fine-grained Evaluation Metric with Ground Truth
Fine-grained Evaluation Metric without Ground Truth
Experiment and Results
Experiment Settings
Task Description
Dataset Description
Baseline Evaluation Metric
Graph Structure
...and 17 more sections

Figures (4)

Figure 1: The comparison between the coarse-grained metric and the binary baseline. The baseline is depicted as bars at scores 0 and 1. Our metric is represented by the line. The red line superimposed on the bar graph visualizes the aggregated percentages before and after the demarcation point.
Figure 2: Results of fine-grained evaluation metric with ground truth. The vertical axis indicates the percentage of attack prompts, while the horizontal axis depicts the range of scores. Each figure compares our fine-grained metric with ground truth and the binary baseline metric on a specific model. The baseline is depicted as bars at scores 0 and 1, while our metric is represented by the line. The red line superimposed on the bar graph visualizes the aggregated percentages before and after the demarcation point.
Figure 3: Results of fine-grained evaluation metric without ground truth.
Figure 4: The correlation between our evaluation metrics and the baseline. Each color represents one of our evaluation metrics. The x-axis shows the prompts that correspond to our evaluation scores, while the y-axis represents the prompts associated with the baseline scores.

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

TL;DR

Abstract

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)