Table of Contents
Fetching ...

Rethinking How to Evaluate Language Model Jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik

TL;DR

This work argues that current jailbreak evaluations for language models are ill-suited because they rely on opaque, binary outcomes. It defines three metrics—safeguard violation, informativeness, and relative truthfulness—and couples them with a response-preprocessing pipeline (hierarchical tokenization and invalid-segment exclusion) to form a multifaceted evaluation framework. Using a benchmark of 250 malicious intents drawn from three jailbreak systems and three intent datasets, annotated by three humans, the authors show that their approach (especially the combination-level method) yields higher F1 scores than existing string matching, NLU, and NLG baselines, with notable gains from response preprocessing. The study highlights how this richer evaluation can better align safety assessments with attacker goals and supports safer deployment of LLMs, while outlining limitations and avenues for future work, including dataset expansion and further robustness enhancements.

Abstract

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

Rethinking How to Evaluate Language Model Jailbreak

TL;DR

This work argues that current jailbreak evaluations for language models are ill-suited because they rely on opaque, binary outcomes. It defines three metrics—safeguard violation, informativeness, and relative truthfulness—and couples them with a response-preprocessing pipeline (hierarchical tokenization and invalid-segment exclusion) to form a multifaceted evaluation framework. Using a benchmark of 250 malicious intents drawn from three jailbreak systems and three intent datasets, annotated by three humans, the authors show that their approach (especially the combination-level method) yields higher F1 scores than existing string matching, NLU, and NLG baselines, with notable gains from response preprocessing. The study highlights how this richer evaluation can better align safety assessments with attacker goals and supports safer deployment of LLMs, while outlining limitations and avenues for future work, including dataset expansion and further robustness enhancements.

Abstract

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.
Paper Structure (36 sections, 21 figures, 5 tables)

This paper contains 36 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: The process of language model jailbreak.
  • Figure 1: System prompt template for the NLG evaluation method chao_jailbreaking_2023. {Intent Content} is replaced with the intent.
  • Figure 2: An example of a jailbreak, where an adversarial prompt is used to attack the language model to answer questions which the language model originally refused.
  • Figure 2: Prompt template for the multifaceted evaluation method on SV. {Response Segment Content} is replaced with the response segment.
  • Figure 3: Response incorrectly labeled as a failed jailbreak by the SM approach due to detecting deny list words.
  • ...and 16 more figures