A Lightweight Explainable Guardrail for Prompt Safety

Md Asiful Islam; Mihai Surdeanu

A Lightweight Explainable Guardrail for Prompt Safety

Md Asiful Islam, Mihai Surdeanu

TL;DR

A lightweight explainable guardrail (LEG) method for the classification of unsafe prompts that obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.

Abstract

We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.

A Lightweight Explainable Guardrail for Prompt Safety

TL;DR

Abstract

Paper Structure (51 sections, 12 equations, 2 figures, 11 tables, 1 algorithm)

This paper contains 51 sections, 12 equations, 2 figures, 11 tables, 1 algorithm.

Introduction
Related work
Alignment-based methods:
External guardrails:
Explainable guardrails:
Proposed method
Architecture
Synthetic data generation for explanations
Joint training
Joint loss function
Prompt classification loss:
Explainability classification loss:
Uncertainty-based task weighting:
Auxiliary weak supervision generation
Experiment setup
...and 36 more sections

Figures (2)

Figure 1: Overview of LEG. (a) The multi-task architecture jointly trains a prompt classifier and an explanation classifier on top of a shared transformer encoder. (b) Example of an unsafe input prompt and the structured output produced by LEG, which includes both the safety label and the corresponding explanation tokens.
Figure 2: Prompt for word label generation.

A Lightweight Explainable Guardrail for Prompt Safety

TL;DR

Abstract

A Lightweight Explainable Guardrail for Prompt Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (2)