Table of Contents
Fetching ...

JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models

Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang

TL;DR

This work tackles the gap in safety evaluation for Chinese-language LLMs by introducing JailBench, a comprehensive benchmark with a refined two-level taxonomy (5 domains, 40 risk types) tailored to Chinese contexts. It combines automated dataset expansion via context-learning with the Automatic Jailbreak Prompt Engineer (AJPE), a framework that uses LLM-driven few-shot generation and log-probability scoring to produce scalable, high-quality jailbreak prompts. The dataset comprises 10,800 jailbreak-enhanced queries, derived from over 10k seed questions and 20 top prompts applied to 540 seeds, and is evaluated across 13 mainstream LLMs, achieving a high ASR of $73.86\%$ against ChatGPT. The results reveal substantial safety gaps, show that newer models generally improve safety alignment, but that larger models can be more vulnerable to jailbreaks, underscoring the need for robust safety testing and ongoing defense improvements in the Chinese LLM landscape; the benchmark is publicly available for broad reuse.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at https://github.com/STAIR-BUPT/JailBench.

JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models

TL;DR

This work tackles the gap in safety evaluation for Chinese-language LLMs by introducing JailBench, a comprehensive benchmark with a refined two-level taxonomy (5 domains, 40 risk types) tailored to Chinese contexts. It combines automated dataset expansion via context-learning with the Automatic Jailbreak Prompt Engineer (AJPE), a framework that uses LLM-driven few-shot generation and log-probability scoring to produce scalable, high-quality jailbreak prompts. The dataset comprises 10,800 jailbreak-enhanced queries, derived from over 10k seed questions and 20 top prompts applied to 540 seeds, and is evaluated across 13 mainstream LLMs, achieving a high ASR of against ChatGPT. The results reveal substantial safety gaps, show that newer models generally improve safety alignment, but that larger models can be more vulnerable to jailbreaks, underscoring the need for robust safety testing and ongoing defense improvements in the Chinese LLM landscape; the benchmark is publicly available for broad reuse.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at https://github.com/STAIR-BUPT/JailBench.

Paper Structure

This paper contains 21 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of existing Chinese safety assessment benchmarks and our JailBench. Here we measure the Attack Success Rate (ASR) of the benchmark on ChatGPT as a measure of its effectiveness for LLMs safety evaluations.
  • Figure 2: JailBench’s taxonomy with two levels consisting of 5 risk domains and 40 specific categories.
  • Figure 3: Instructions for the automatic labelling task. The instruction consists of three main parts: task description, categorisation criteria, and output format restrictions.
  • Figure 4: Prompts for unsafe question augmentation task. The instruction can be divided into three main parts: explicit generation of tasks and target topics, reference examples, and output formatting restrictions.
  • Figure 5: flowchart of the AJPE process.