Table of Contents
Fetching ...

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

Zhihao Liu, Chenhui Hu

TL;DR

A safety assessment benchmark is introduced, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts, and indicates that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement.

Abstract

As large language models (LLMs) rapidly evolve, they bring significant conveniences to our work and daily lives, but also introduce considerable safety risks. These models can generate texts with social biases or unethical content, and under specific adversarial instructions, may even incite illegal activities. Therefore, rigorous safety assessments of LLMs are crucial. In this work, we introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts. This test set was used to evaluate the natural language generation (NLG) capabilities of LLMs, employing a combination of simple moral judgment and a 1-5 safety rating scale for scoring. Using this benchmark, we tested eight popular LLMs, including the GPT series. The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement. The data and code associated with this study are available on GitHub.

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

TL;DR

A safety assessment benchmark is introduced, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts, and indicates that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement.

Abstract

As large language models (LLMs) rapidly evolve, they bring significant conveniences to our work and daily lives, but also introduce considerable safety risks. These models can generate texts with social biases or unethical content, and under specific adversarial instructions, may even incite illegal activities. Therefore, rigorous safety assessments of LLMs are crucial. In this work, we introduce a safety assessment benchmark, CFSafety, which integrates 5 classic safety scenarios and 5 types of instruction attacks, totaling 10 categories of safety questions, to form a test set with 25k prompts. This test set was used to evaluate the natural language generation (NLG) capabilities of LLMs, employing a combination of simple moral judgment and a 1-5 safety rating scale for scoring. Using this benchmark, we tested eight popular LLMs, including the GPT series. The results indicate that while GPT-4 demonstrated superior safety performance, the safety effectiveness of LLMs, including this model, still requires improvement. The data and code associated with this study are available on GitHub.

Paper Structure

This paper contains 10 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Ten categories of LLMs safety issues and their detailed explanations. These safety problems include, but are not limited to, social bias, criminal and unethical content, and data leakage. Each item is thoroughly described with its potential risks and methods of attack.
  • Figure 2: The performance of eight popular large language models (LLMs) under the CFSafety safety assessment framework is depicted. We use a radar chart to detailedly show the scores of each model across 10 types of safety measures, and the scores are averaged and ranked accordingly.
  • Figure 3: The framework of CFSafety, we concatenate test questions, LLM responses, and safety issue category templates, and feed them into the evaluation LLM. This process yields initial moral judgments and the probabilities of output tokens for safety ratings from 1 to 5, weighted by their likelihood. We combine these two aspects to ultimately derive our CFSafety score.
  • Figure 4: Radar chart of the performance scores of eight popular large language models under the CFSafety framework for 10 types of safety issues.
  • Figure 5: Ridge plot of the distribution of 1-5 scores for ten safety issues in eight popular large language models under the CFSafety framework.