ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi; Dongjin Kim; Seungbin Yang; Subin Kim; Youngjun Kwak; Juyoung Oh; Jaegul Choo; Jungmin Son

ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh, Jaegul Choo, Jungmin Son

TL;DR

ExpGuard is introduced, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains and open-source the code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

Abstract

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

ExpGuard: LLM Content Moderation in Specialized Domains

TL;DR

Abstract

Paper Structure (61 sections, 14 figures, 23 tables)

This paper contains 61 sections, 14 figures, 23 tables.

Introduction
Related work
LLM Alignment & Content Moderation.
Safety Datasets & Benchmarks.
Constructing ExpGuardMix and ExpGuard
ExpGuardTrain: a multi-domain safety training dataset
Terminology mining
Prompt and response construction
Harmful domain-specific prompts.
Benign domain-specific prompts.
In-the-wild and human-written prompts.
Compliant and refusal responses.
Category labeling and data filtering
ExpGuardTest: an expert-annotated multi-domain safety benchmark
Training ExpGuard
...and 46 more sections

Figures (14)

Figure 1: (a) Illustration of a domain-specific adversarial attack, where ExpGuard successfully identifies and refuses a harmful domain-specific prompt that bypasses existing guardrails. (b) Overview of the ExpGuardMix composition, detailing the distribution of prompt/response types and their allocation across financial, medical, and legal domains.
Figure 1: Content safety risk taxonomy covered by ExpGuardMix.
Figure 2: Overview of the ExpGuardMix construction pipeline. The process consists of three main stages: (1) Domain-Specific Terminology Mining, involving term extraction from Wikipedia, followed by filtering using Wikidata, GPT-4o, and human verification; (2) Prompt and Response Construction, where domain-specific terms are used with GPT-4o to generate harmful and benign prompts, with corresponding harmful/benign responses and refusals generated by LLMs; and (3) Category Labeling and Data Filtering, which includes LLM-based classification of generated data into harm categories, majority voting, and deduplication to produce the final dataset.
Figure 3: Harmful domain-specific prompts and responses from ExpGuardMix (Financial, Medical, and Legal) that appear benign, with their harmful nature explained. Each example utilizes a technical term (in bold) to craft queries whose risks are apparent only with domain expertise.
Figure 4: Ablation study on ExpGuardTrain components used for training ExpGuard (%).
...and 9 more figures

ExpGuard: LLM Content Moderation in Specialized Domains

TL;DR

Abstract

ExpGuard: LLM Content Moderation in Specialized Domains

Authors

TL;DR

Abstract

Table of Contents

Figures (14)