Table of Contents
Fetching ...

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

Hengxiang Zhang, Hongfu Gao, Qiang Hu, Guanhua Chen, Lili Yang, Bingyi Jing, Hongxin Wei, Bing Wang, Haifeng Bai, Lei Yang

TL;DR

ChineseSafe tackles the gap in Chinese-language safety assessment for LLMs by introducing a large-scale, China-aligned benchmark with 205,034 examples across 4 safety classes and 10 sub-classes, including political sensitivity, pornography, and variant words. It employs both generation- and perplexity-based evaluation across 26 diverse LLMs, revealing that safety performance is not strictly tied to model size and that generation-based evaluation more effectively detects unsafe content. The paper provides a data collection and processing pipeline, a balanced test setup, and detailed results across categories, offering practical guidance for developing safer Chinese LLMs and informing content moderation. The dataset and evaluation code are publicly released, enabling broader adoption and benchmarking in real-world Chinese scenarios.

Abstract

With the rapid development of Large language models (LLMs), understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of large language models. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark. Additionally, we release a test set comprising 200,000 examples, which is publicly accessible at https://huggingface.co/datasets/SUSTech/ChineseSafe.

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

TL;DR

ChineseSafe tackles the gap in Chinese-language safety assessment for LLMs by introducing a large-scale, China-aligned benchmark with 205,034 examples across 4 safety classes and 10 sub-classes, including political sensitivity, pornography, and variant words. It employs both generation- and perplexity-based evaluation across 26 diverse LLMs, revealing that safety performance is not strictly tied to model size and that generation-based evaluation more effectively detects unsafe content. The paper provides a data collection and processing pipeline, a balanced test setup, and detailed results across categories, offering practical guidance for developing safer Chinese LLMs and informing content moderation. The dataset and evaluation code are publicly released, enabling broader adoption and benchmarking in real-world Chinese scenarios.

Abstract

With the rapid development of Large language models (LLMs), understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of large language models. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark. Additionally, we release a test set comprising 200,000 examples, which is publicly accessible at https://huggingface.co/datasets/SUSTech/ChineseSafe.

Paper Structure

This paper contains 24 sections, 1 figure, 9 tables.

Figures (1)

  • Figure 1: ChineseSafe includes 4 classes and 10 sub-classes of safety issues.