ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Zhexin Zhang; Yida Lu; Jingyuan Ma; Di Zhang; Rui Li; Pei Ke; Hao Sun; Lei Sha; Zhifang Sui; Hongning Wang; Minlie Huang

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang

TL;DR

<3-5 sentence high-level summary> ShieldLM tackles the need for aligned, customizable, and explainable safety detectors for LLM outputs by building a bilingual, rule-aware dataset and training an LLM-based detector that can apply diverse safety standards without extensive instance-level labeling. It harnesses GPT-4 to generate explanations that accompany safety judgments, and introduces a mechanism to inject irrelevant rules during training to improve adaptability to new policies. Across multiple in-distribution and out-of-distribution test sets, ShieldLM achieves state-of-the-art performance, demonstrates strong customizability to different safety standards, and provides interpretable analyses that explain its decisions. The work also validates ShieldLM as a practical safety evaluator for assessing other LLMs, highlighting its potential to support safer deployment of conversational AI in real-world settings.

Abstract

The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for advanced LLMs. ShieldLM is released at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards.

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

TL;DR

Abstract

Paper Structure (54 sections, 4 figures, 11 tables, 1 algorithm)

This paper contains 54 sections, 4 figures, 11 tables, 1 algorithm.

Introduction
Pilot Study
Method
Label Collection
Analysis Generation
Training & Inference
Experiments
Training Setting
Test Sets
Our Test Set
OOD Test Sets
Baselines
Moderation Tools
LLM+Prompt
LLM+Finetuning
...and 39 more sections

Figures (4)

Figure 1: ShieldLM achieves the best performance on both the F$_{1}$-Safe (S) and the F$_{1}$-Unsafe (U) score across 4 datasets. ShieldLM takes customized detection rules to support diverse application scenarios and safety standards, without requiring detailed instance-level annotations or manual prompt crafting, while also producing high-quality explanations.
Figure 2: An overview of our method. We first annotate the safety of various responses under different safety standards (rules) and then use GPT-4 to generate analyses that align with the human labels and rules. Finally, we train ShieldLM with the shown prompt. During training, we also incorporate a variety of irrelevant rules into the prompt to enhance ShieldLM's adaptability to multiple rules. The input for ShieldLM contains three parts: "[fixed template prompt] [custom rules] [the dialogue to be evaluated]", and the output for ShieldLM contains two parts: "[answer] [analysis]".
Figure 3: The influence of the hyperparameter $p$.
Figure 4: Some examples provided to annotators.

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

TL;DR

Abstract

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Authors

TL;DR

Abstract

Table of Contents

Figures (4)