SAFETY-J: Evaluating Safety with Critique

Yixiu Liu; Yuxiang Zheng; Shijie Xia; Jiajun Li; Yi Tu; Chaoling Song; Pengfei Liu

SAFETY-J: Evaluating Safety with Critique

Yixiu Liu, Yuxiang Zheng, Shijie Xia, Jiajun Li, Yi Tu, Chaoling Song, Pengfei Liu

TL;DR

SAFETY-J tackles the need for interpretable safety evaluation in LLM outputs by introducing a bilingual, critique-based safety evaluator. It combines a diverse training corpus, automated meta-evaluation of critique quality, and iterative safety preference learning via Direct Preference Optimization to continuously improve judgments. The approach yields superior performance on English and Chinese safety benchmarks, enhances critique quality, and enables practical uses like online correction and rule-based customization. The work is open-source and lays groundwork for scalable, explainable safety evaluation in multilingual LLM deployments, while outlining limitations and future directions such as handling multi-turn dialogues and integrating retrieval-augmented methods.

Abstract

The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we open-source SAFETY-J's training protocols, datasets, and code at https://github.com/GAIR-NLP/Safety-J.

SAFETY-J: Evaluating Safety with Critique

TL;DR

Abstract

Paper Structure (41 sections, 9 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 9 figures, 10 tables, 1 algorithm.

Introduction
Related Work
LLM Safety
Critique-based Evaluation
Preference Learning for LLM
Safety-J
Training Data Curation
Collection
Critique Synthesis
Quality Inspect
Training
Meta-evaluation
Iterative Safety Preference Learning
Experiments
Training Setting
...and 26 more sections

Figures (9)

Figure 1: Safety scenarios covered by Safety-J.
Figure 2: An overview of our method.
Figure 3: Comparison of accuracy across different model versions and test sets. The graph depicts the accuracy performance of five Safety-J versions ($M_1$ to $M_5$) evaluated on various test sets: BeaverTails, DiaSafety, Jade, Flames, and WildSafety. Each line represents the accuracy trend for a specific test set across the model versions.
Figure 4: Performance comparison of Safety-J versions on English and Chinese meta-evaluation test sets. This figure displays the precision, recall, and F1 scores (Micro) for different versions ($M_1$ to $M_5$) of the Safety-J on English and Chinese meta-evaluation test sets.
Figure 5: The safety rate of responses generated by the Qwen under different conditions. The original bar represents the safety rate when Qwen generates responses directly. The ShieldLM, Safety-J ($M_1$), and Safety-J ($M_5$) bars indicate the safety rates when Qwen generates initial responses, which are then critiqued by the respective models (ShieldLM, Safety-J ($M_1$), Safety-J ($M_5$)), and subsequently revised by Qwen based on these critiques.
...and 4 more figures

SAFETY-J: Evaluating Safety with Critique

TL;DR

Abstract

SAFETY-J: Evaluating Safety with Critique

Authors

TL;DR

Abstract

Table of Contents

Figures (9)