Table of Contents
Fetching ...

PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, Maarten Sap

TL;DR

PolyGuard tackles multilingual safety moderation across 17 languages by integrating a large multilingual dataset (PolyGuardMix) and a multilingual evaluation benchmark (PolyGuardPrompts). It trains multi-task safety detectors via LoRA-finetuned models (Qwen2.5 and Ministral), achieving state-of-the-art performance against both open-weight and proprietary baselines. The study demonstrates that combining ITW data with translated data yields robust performance across in-distribution and out-of-distribution benchmarks, including code-switching scenarios, and provides open-source models of varying sizes for practical deployment. It also explores data quality and translation effects, showing that MT artifacts do not solely drive performance and that dataset design significantly shapes multilingual safety moderation in LLMs.

Abstract

Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.

PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

TL;DR

PolyGuard tackles multilingual safety moderation across 17 languages by integrating a large multilingual dataset (PolyGuardMix) and a multilingual evaluation benchmark (PolyGuardPrompts). It trains multi-task safety detectors via LoRA-finetuned models (Qwen2.5 and Ministral), achieving state-of-the-art performance against both open-weight and proprietary baselines. The study demonstrates that combining ITW data with translated data yields robust performance across in-distribution and out-of-distribution benchmarks, including code-switching scenarios, and provides open-source models of varying sizes for practical deployment. It also explores data quality and translation effects, showing that MT artifacts do not solely drive performance and that dataset design significantly shapes multilingual safety moderation in LLMs.

Abstract

Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.

Paper Structure

This paper contains 31 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: PolyGuard takes in a user prompt and model response (optional) and lists the safety labels, violations, and model compliance following the same safety taxonomy as Llama-Guard-3dubey2024llama3herdmodels. Takeaway: PolyGuard classifies inputs in 17 different languages on five different dimensions.
  • Figure 2: Data curation process for PGMix (safety detection training) and PGPrompts (safety guardrail evaluation). Takeaway: PGMix combines machine-translated and naturally occurring data to improve data diversity and, consequently, model performance.
  • Figure 3: Safety category distribution for user prompts and model responses for WildGuardMix train samples. The model name (GPT-4o and Llama-Guard-3-8B) represents the LLM used as a judge to automatically annotate the safety category. These annotations are then ensembled together, using Llama3.1-405B-Instruct to break ties (Combined). Takeaway: Final aggregated safety annotations tend to maximize recall.
  • Figure 4: Safety category distributions for PGMix ITW samples.
  • Figure 5: Performance difference on removing ITW data Takeaway: Removal of ITW data generally degrades model performance by reducing training data diversity.
  • ...and 4 more figures