Table of Contents
Fetching ...

JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, Hao Liu

TL;DR

JAILJUDGE tackles the challenge of reliably evaluating LLM safety against jailbreaking by introducing a comprehensive benchmark and a multi-agent, explainability-focused evaluation framework. It combines diverse, high-quality data (35k+ instruction-tuning items and 4.5k/6k labeled test sets across multilingual contexts) with a JailJudge MultiAgent system that produces interpretable, fine-grained scores (1–10) using evidence-theoretic fusion. The authors also present JAILJUDGE Guard, an end-to-end jailbreak judge model trained on the benchmark, and two enhancement methods—JailBoost and GuardShield—for attacker enhancement and defense, respectively. Empirical results demonstrate SOTA jailbreak-judgment performance and substantial improvements in defense effectiveness (ASR dropping to 0.15% in zero-shot tests) and attacker effectiveness with JailBoost, underscoring the framework’s practical value for safety evaluation, policy enforcement, and cost-efficient moderation.

Abstract

Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.

JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

TL;DR

JAILJUDGE tackles the challenge of reliably evaluating LLM safety against jailbreaking by introducing a comprehensive benchmark and a multi-agent, explainability-focused evaluation framework. It combines diverse, high-quality data (35k+ instruction-tuning items and 4.5k/6k labeled test sets across multilingual contexts) with a JailJudge MultiAgent system that produces interpretable, fine-grained scores (1–10) using evidence-theoretic fusion. The authors also present JAILJUDGE Guard, an end-to-end jailbreak judge model trained on the benchmark, and two enhancement methods—JailBoost and GuardShield—for attacker enhancement and defense, respectively. Empirical results demonstrate SOTA jailbreak-judgment performance and substantial improvements in defense effectiveness (ASR dropping to 0.15% in zero-shot tests) and attacker effectiveness with JailBoost, underscoring the framework’s practical value for safety evaluation, policy enforcement, and cost-efficient moderation.

Abstract

Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.

Paper Structure

This paper contains 32 sections, 10 equations, 23 figures, 5 tables, 2 algorithms.

Figures (23)

  • Figure 1: JAILJUDGE Benchmark and Multi-agent Judge Framework
  • Figure 2: F1 scores across ten different languages using our JailJudge MultiAgent.
  • Figure 3: Ablation study on datasets JAILJUDGE ID and JBB Behaviors.
  • Figure 4: Ablation study on datasets JAILJUDGE OOD and WILDTEST.
  • Figure 5: Exp. on JailBoost ($\text{ASR}$ % $\uparrow$).
  • ...and 18 more figures