JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang
TL;DR
This work tackles the gap in safety evaluation for Chinese-language LLMs by introducing JailBench, a comprehensive benchmark with a refined two-level taxonomy (5 domains, 40 risk types) tailored to Chinese contexts. It combines automated dataset expansion via context-learning with the Automatic Jailbreak Prompt Engineer (AJPE), a framework that uses LLM-driven few-shot generation and log-probability scoring to produce scalable, high-quality jailbreak prompts. The dataset comprises 10,800 jailbreak-enhanced queries, derived from over 10k seed questions and 20 top prompts applied to 540 seeds, and is evaluated across 13 mainstream LLMs, achieving a high ASR of $73.86\%$ against ChatGPT. The results reveal substantial safety gaps, show that newer models generally improve safety alignment, but that larger models can be more vulnerable to jailbreaks, underscoring the need for robust safety testing and ongoing defense improvements in the Chinese LLM landscape; the benchmark is publicly available for broad reuse.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at https://github.com/STAIR-BUPT/JailBench.
