S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Xiaohan Yuan; Jinfeng Li; Dongxia Wang; Yuefeng Chen; Xiaofeng Mao; Longtao Huang; Jialuo Chen; Hui Xue; Xiaoxia Liu; Wenhai Wang; Kui Ren; Jingyi Wang

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, Jingyi Wang

TL;DR

This work tackles the challenge of comprehensively evaluating LLM safety by introducing S-Eval, a framework that unifies risk taxonomy with automated test generation and a safety critique LLM. It defines an 8-dimension, 102-risk taxonomy and constructs a large, bilingual safety benchmark of 220,000 test cases to drive automated testing. The approach combines an expert testing LLM for test generation with a safety critique LLM that provides quantitative scores and explanations, validated across 21 LLMs and multilingual settings. Findings show S-Eval yields more discriminative safety assessments than existing benchmarks and illuminate how model scale, language, and decoding parameters influence safety, with an open benchmark to foster safer LLM deployment.

Abstract

Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM ${M}_t$ and a novel safety critique LLM ${M}_c$. ${M}_t$ is responsible for automatically generating test cases in accordance with the proposed risk taxonomy. ${M}_c$ can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval is efficient and effective in test generation and safety evaluation. Moreover, S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios. Our benchmark is publicly available at https://github.com/IS2Lab/S-Eval.

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

TL;DR

Abstract

and a novel safety critique LLM

is responsible for automatically generating test cases in accordance with the proposed risk taxonomy.

can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval is efficient and effective in test generation and safety evaluation. Moreover, S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios. Our benchmark is publicly available at https://github.com/IS2Lab/S-Eval.

Paper Structure (29 sections, 3 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 9 figures, 8 tables, 1 algorithm.

Introduction
Preliminaries
Large Language Models
Problem Definition
The S-Eval Framework
Overview
Risk Management
Automatic Test Generation
Base Risk Prompt Generation
Attack Prompt Generation
High-quality Test Selection
Automatic Safety Evaluation
Experiments
Experimental Setup
Datasets and Models
...and 14 more sections

Figures (9)

Figure 1: Framework of S-Eval. "BRP"stands for base risk prompt and "AP" refers to attack prompt.
Figure 2: The example of automatic test generation.
Figure 3: The example of automatic safety evaluation.
Figure 4: The consistency and correlation analysis of different evaluation methods. (a) The horizontal axis represents the number of methods with a same evaluation result. (b) The horizontal and vertical axes represent the $SS$.
Figure 5: The safety score distributions on Chinese and English.
...and 4 more figures

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

TL;DR

Abstract

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)