Table of Contents
Fetching ...

SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

Panuthep Tasawong, Jian Gang Ngui, Alham Fikri Aji, Trevor Cohn, Peerat Limkonchotiwat

TL;DR

SEA-SafeguardBench introduces a native, culturally nuanced safety benchmark for Southeast Asia, covering eight languages and 1,338 cultural topics across three subsets (General, In-the-Wild, Content Generation). It employs three native-authoring pipelines with human verification to capture local norms, taboos, and region-specific harm scenarios, and evaluates a wide range of safeguards and LLMs with AUPRC as the primary metric. The results reveal substantial cross-language safety gaps, with SEA languages underperforming English, especially on culturally nuanced tasks, although culture-aware prompting and SEA pretraining show gains for some models. The work provides detailed error analyses and practical guidance for building culturally inclusive safety systems and motivates further SEA-language safety research and expansion to additional languages and contexts.

Abstract

Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.

SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

TL;DR

SEA-SafeguardBench introduces a native, culturally nuanced safety benchmark for Southeast Asia, covering eight languages and 1,338 cultural topics across three subsets (General, In-the-Wild, Content Generation). It employs three native-authoring pipelines with human verification to capture local norms, taboos, and region-specific harm scenarios, and evaluates a wide range of safeguards and LLMs with AUPRC as the primary metric. The results reveal substantial cross-language safety gaps, with SEA languages underperforming English, especially on culturally nuanced tasks, although culture-aware prompting and SEA pretraining show gains for some models. The work provides detailed error analyses and practical guidance for building culturally inclusive safety systems and motivates further SEA-language safety research and expansion to additional languages and contexts.

Abstract

Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.

Paper Structure

This paper contains 49 sections, 22 figures, 21 tables.

Figures (22)

  • Figure 1: The sample from our three subset benchmarks and how we create them. We have three categories: (i) common safety topics around the world, (ii) an in-the-wild dataset, and (iii) content generation in Southeast Asia.
  • Figure 2: Data statistics of SEA-SafeguardBench. Please refer to Appendix \ref{['appendix:label_distribution']} for the full distribution.
  • Figure 3: Visualization of general and cultural sets. To remove the language bias, all samples were written in English, and each point represents the culture sample of each country, not the language.
  • Figure 4: Confusion matrices of four types of prompt-response pair, evaluated with (A) and without (B) prompt access during response classification. In both settings, the prompt can be accessed during prompt classification.
  • Figure 5: Safeguard performance on prompt classification (top) and response classification (bottom) across different threshold values.
  • ...and 17 more figures