Table of Contents
Fetching ...

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan

TL;DR

SceneJailEval addresses the lack of scenario-aware, multi-dimensional jailbreak evaluation by introducing a scenario-adaptive framework that selects relevant dimensions and weights per scenario. It combines a Scenario Classifier, Scenario-Dim Adapter, Jailbreak Detector, and Harmfulness Evaluator to produce both a jailbreak status and a nuanced harm score, with weights derived via Delphi and AHP. The authors provide a 14-scenario benchmark and demonstrate state-of-the-art performance (F1 up to $0.917$ on their dataset and $0.995$ on JBB) across heterogeneous contexts, while also validating extensibility through a customized scenario. The work offers a principled, extensible approach for comprehensive LLM security assessment, with strong implications for benchmarking, governance, and defense against diverse jailbreak threats.

Abstract

Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

TL;DR

SceneJailEval addresses the lack of scenario-aware, multi-dimensional jailbreak evaluation by introducing a scenario-adaptive framework that selects relevant dimensions and weights per scenario. It combines a Scenario Classifier, Scenario-Dim Adapter, Jailbreak Detector, and Harmfulness Evaluator to produce both a jailbreak status and a nuanced harm score, with weights derived via Delphi and AHP. The authors provide a 14-scenario benchmark and demonstrate state-of-the-art performance (F1 up to on their dataset and on JBB) across heterogeneous contexts, while also validating extensibility through a customized scenario. The work offers a principled, extensible approach for comprehensive LLM security assessment, with strong implications for benchmarking, governance, and defense against diverse jailbreak threats.

Abstract

Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions: (1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. (2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. (3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6% over SOTA) and 0.995 on JBB (+3% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.

Paper Structure

This paper contains 23 sections, 12 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of SceneJailEval, including dataset construction and evaluation framework.

Theorems & Definitions (2)

  • Definition 1: Jailbreak Attack
  • Definition 2: Jailbreak Evaluation