LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

Jessica Foo; Shaun Khoo

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

Jessica Foo, Shaun Khoo

TL;DR

LionGuard tackles the gap in moderation for low-resource, localized languages by building a Singapore-contextualized moderation classifier for Singlish. It combines a carefully designed safety taxonomy, large-scale automated labeling with multiple LLMs, and a lightweight embedding-plus-classifier architecture to outperform general-purpose moderation APIs on Singlish data. Key contributions include a 138k Singlish dataset, a systematic labeling pipeline with consensus LLMs, and a robust evaluation showing significant PR-AUC gains, especially in harassment and sexual/toxic categories. The work demonstrates the value of localization for moderation and provides a scalable blueprint for adapting safety tools to other low-resource languages, with open-source availability on Hugging Face.

Abstract

As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

TL;DR

Abstract

Paper Structure (36 sections, 10 figures, 8 tables)

This paper contains 36 sections, 10 figures, 8 tables.

Introduction
Singlish, an English Creole
Related Work
Content moderation
Low-resource language adaptation for moderation
Automated labelling
Methodology
Data Collection
Safety Risk Taxonomy
Automated Labelling
Engineering the labelling prompt
LLM Selection
Determining the Threshold for Safety
Compiling the dataset
Moderation Classifier
...and 21 more sections

Figures (10)

Figure 1: Overview of the 4-step methodology in building LionGuard
Figure 2: F1 scores and agreement across the 4 candidate LLMs for the prompt ablation comparison
Figure 3: F1 scores for each combination of prompt and candidate LLM
Figure 4: Comparing F1 scores and agreement for different threshold levels
Figure 5: Instructions Page. Page 1 of top section shows generic task title descriptions. Bottom section is a scrollable section that shows detailed task description and trigger warning.
...and 5 more figures

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

TL;DR

Abstract

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

Authors

TL;DR

Abstract

Table of Contents

Figures (10)