Table of Contents
Fetching ...

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

Jessica Foo, Shaun Khoo

TL;DR

LionGuard tackles the gap in moderation for low-resource, localized languages by building a Singapore-contextualized moderation classifier for Singlish. It combines a carefully designed safety taxonomy, large-scale automated labeling with multiple LLMs, and a lightweight embedding-plus-classifier architecture to outperform general-purpose moderation APIs on Singlish data. Key contributions include a 138k Singlish dataset, a systematic labeling pipeline with consensus LLMs, and a robust evaluation showing significant PR-AUC gains, especially in harassment and sexual/toxic categories. The work demonstrates the value of localization for moderation and provides a scalable blueprint for adapting safety tools to other low-resource languages, with open-source availability on Hugging Face.

Abstract

As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

TL;DR

LionGuard tackles the gap in moderation for low-resource, localized languages by building a Singapore-contextualized moderation classifier for Singlish. It combines a carefully designed safety taxonomy, large-scale automated labeling with multiple LLMs, and a lightweight embedding-plus-classifier architecture to outperform general-purpose moderation APIs on Singlish data. Key contributions include a 138k Singlish dataset, a systematic labeling pipeline with consensus LLMs, and a robust evaluation showing significant PR-AUC gains, especially in harassment and sexual/toxic categories. The work demonstrates the value of localization for moderation and provides a scalable blueprint for adapting safety tools to other low-resource languages, with open-source availability on Hugging Face.

Abstract

As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.
Paper Structure (36 sections, 10 figures, 8 tables)

This paper contains 36 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of the 4-step methodology in building LionGuard
  • Figure 2: F1 scores and agreement across the 4 candidate LLMs for the prompt ablation comparison
  • Figure 3: F1 scores for each combination of prompt and candidate LLM
  • Figure 4: Comparing F1 scores and agreement for different threshold levels
  • Figure 5: Instructions Page. Page 1 of top section shows generic task title descriptions. Bottom section is a scrollable section that shows detailed task description and trigger warning.
  • ...and 5 more figures