Table of Contents
Fetching ...

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Jialuo Chen, Hui Xue, Xiaoxia Liu, Wenhai Wang, Kui Ren, Jingyi Wang

TL;DR

This work tackles the challenge of comprehensively evaluating LLM safety by introducing S-Eval, a framework that unifies risk taxonomy with automated test generation and a safety critique LLM. It defines an 8-dimension, 102-risk taxonomy and constructs a large, bilingual safety benchmark of 220,000 test cases to drive automated testing. The approach combines an expert testing LLM for test generation with a safety critique LLM that provides quantitative scores and explanations, validated across 21 LLMs and multilingual settings. Findings show S-Eval yields more discriminative safety assessments than existing benchmarks and illuminate how model scale, language, and decoding parameters influence safety, with an open benchmark to foster safer LLM deployment.

Abstract

Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM ${M}_t$ and a novel safety critique LLM ${M}_c$. ${M}_t$ is responsible for automatically generating test cases in accordance with the proposed risk taxonomy. ${M}_c$ can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval is efficient and effective in test generation and safety evaluation. Moreover, S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios. Our benchmark is publicly available at https://github.com/IS2Lab/S-Eval.

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

TL;DR

This work tackles the challenge of comprehensively evaluating LLM safety by introducing S-Eval, a framework that unifies risk taxonomy with automated test generation and a safety critique LLM. It defines an 8-dimension, 102-risk taxonomy and constructs a large, bilingual safety benchmark of 220,000 test cases to drive automated testing. The approach combines an expert testing LLM for test generation with a safety critique LLM that provides quantitative scores and explanations, validated across 21 LLMs and multilingual settings. Findings show S-Eval yields more discriminative safety assessments than existing benchmarks and illuminate how model scale, language, and decoding parameters influence safety, with an open benchmark to foster safer LLM deployment.

Abstract

Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM and a novel safety critique LLM . is responsible for automatically generating test cases in accordance with the proposed risk taxonomy. can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval is efficient and effective in test generation and safety evaluation. Moreover, S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios. Our benchmark is publicly available at https://github.com/IS2Lab/S-Eval.
Paper Structure (29 sections, 3 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Framework of S-Eval. "BRP"stands for base risk prompt and "AP" refers to attack prompt.
  • Figure 2: The example of automatic test generation.
  • Figure 3: The example of automatic safety evaluation.
  • Figure 4: The consistency and correlation analysis of different evaluation methods. (a) The horizontal axis represents the number of methods with a same evaluation result. (b) The horizontal and vertical axes represent the $SS$.
  • Figure 5: The safety score distributions on Chinese and English.
  • ...and 4 more figures