JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu, Yue Feng, Zhao Xu, Lixin Su, Xinyu Ma, Dawei Yin, Hao Liu
TL;DR
JAILJUDGE tackles the challenge of reliably evaluating LLM safety against jailbreaking by introducing a comprehensive benchmark and a multi-agent, explainability-focused evaluation framework. It combines diverse, high-quality data (35k+ instruction-tuning items and 4.5k/6k labeled test sets across multilingual contexts) with a JailJudge MultiAgent system that produces interpretable, fine-grained scores (1–10) using evidence-theoretic fusion. The authors also present JAILJUDGE Guard, an end-to-end jailbreak judge model trained on the benchmark, and two enhancement methods—JailBoost and GuardShield—for attacker enhancement and defense, respectively. Empirical results demonstrate SOTA jailbreak-judgment performance and substantial improvements in defense effectiveness (ASR dropping to 0.15% in zero-shot tests) and attacker effectiveness with JailBoost, underscoring the framework’s practical value for safety evaluation, policy enforcement, and cost-efficient moderation.
Abstract
Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.
