A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang; Zenan Zhai; Haonan Li; Xudong Han; Lizhi Lin; Zhenxuan Zhang; Jingru Zhao; Preslav Nakov; Timothy Baldwin

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin

TL;DR

This work addresses the English-centric bias in LLM safety evaluation by introducing a Chinese safety dataset that captures region-specific risks. It translates and localizes the Do-not-answer dataset into Mandarin, expands it with region-specific prompts, and provides three attack perspectives, a six-category risk taxonomy, and 17 harm types, totaling 3,042 prompts and about 15k model responses from five LLMs. The authors propose fine-grained manual and automatic evaluation guidelines and demonstrate, through extensive experiments and GPT-4-based scoring, that region-specific risks dominate unsafe responses, with notable differences across Chinese- versus English-trained models. The dataset is open-source and intended to advance safety assessment and alignment for Chinese and multilingual LLMs, guiding future data augmentation and automated risk-detection methods.

Abstract

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 5 figures, 10 tables)

This paper contains 21 sections, 5 figures, 10 tables.

Introduction
Related Work
Assessing Particular Types of Risk
Prompt Engineering for Jailbreaking
Multilingual Risk Evaluation of LLMs
Dataset
Experiments
LLM Response Collection
Harmfulness Evaluation
Evaluation Strategy
Automatic Assessment Using GPT-4
Safety Rank
Risk Category
Question Type
Sensitivity Evaluation
...and 6 more sections

Figures (5)

Figure 1: Number of harmful responses for five different Chinese LLMs. We can see that LLaMA2, as an English-centric model, is the safest LLM when testing using English direct questions from the Do-not-answer dataset, but it is also the least safe one when evaluated using our Chinese-centric questions.
Figure 2: Harmful response distribution over the six risk areas. I = misinformation harms, II = human-chatbot interaction harms, III = malicious uses, IV = discrimination, exclusion, toxicity, hateful, offensive, V = information hazards, and VI = region/religion-specific sensitive topics.
Figure 3: Harmful response distribution over three types of questions: direct attack, indirect attack, and harmless questions with risk-sensitive words/phrases.
Figure 4: The distribution of response patterns across the five Chinese LLMs.
Figure 5: The confusion matrix of GPT-4 evaluation against human annotation as gold standard. GPT-4 can identify the majority of safe responses correctly, demonstrating random guess performance on harmful responses. For action classification, responses falling into categories of 3 and 4 tend to be classified as 5 by GPT-4, impling that human makes more fine-grained distinctions between different responding patterns than GPT-4.

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

TL;DR

Abstract

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)