Table of Contents
Fetching ...

Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Measure Multilingual Safety Gaps

Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee

TL;DR

RabakBench presents a localized multilingual safety benchmark and scalable Generate-Label-Translate pipeline tailored to Singapore's Singlish, Chinese, Malay, and Tamil. It combines adversarial data generation, weak supervision with high-agreement LLM annotators, and toxicity-preserving translation to produce a 5,364-entry parallel safety corpus across four languages, with 76.6% labeled unsafe. The evaluation of 13 guardrails reveals pronounced safety gaps and language-dependent degradation, especially for Tamil, underscoring the need for native-context training and context-aware evaluation. The work offers a reproducible framework for building and extending localized safety benchmarks and provides open-source data to advance multilingual AI safety research.

Abstract

Large language models (LLMs) often fail to maintain safety in low-resource language varieties, such as code-mixed vernaculars and regional dialects. We introduce RabakBench, a multilingual safety benchmark and scalable pipeline localized to Singapore's unique linguistic landscape, covering Singlish, Chinese, Malay, and Tamil. We construct the benchmark through a three-stage pipeline: (1) Generate: augmenting real-world unsafe web content via LLM-driven red teaming; (2) Label: applying semi-automated multi-label annotation using majority-voted LLM labelers; and (3) Translate: performing high-fidelity, toxicity-preserving translation. The resulting dataset contains over 5,000 examples across six fine-grained safety categories. Despite using LLMs for scalability, our framework maintains rigorous human oversight, achieving 0.70-0.80 inter-annotator agreement. Evaluations of 13 state-of-the-art guardrails reveal significant performance degradation, underscoring the need for localized evaluation. RabakBench provides a reproducible framework for building safety benchmarks in underserved communities.

Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Measure Multilingual Safety Gaps

TL;DR

RabakBench presents a localized multilingual safety benchmark and scalable Generate-Label-Translate pipeline tailored to Singapore's Singlish, Chinese, Malay, and Tamil. It combines adversarial data generation, weak supervision with high-agreement LLM annotators, and toxicity-preserving translation to produce a 5,364-entry parallel safety corpus across four languages, with 76.6% labeled unsafe. The evaluation of 13 guardrails reveals pronounced safety gaps and language-dependent degradation, especially for Tamil, underscoring the need for native-context training and context-aware evaluation. The work offers a reproducible framework for building and extending localized safety benchmarks and provides open-source data to advance multilingual AI safety research.

Abstract

Large language models (LLMs) often fail to maintain safety in low-resource language varieties, such as code-mixed vernaculars and regional dialects. We introduce RabakBench, a multilingual safety benchmark and scalable pipeline localized to Singapore's unique linguistic landscape, covering Singlish, Chinese, Malay, and Tamil. We construct the benchmark through a three-stage pipeline: (1) Generate: augmenting real-world unsafe web content via LLM-driven red teaming; (2) Label: applying semi-automated multi-label annotation using majority-voted LLM labelers; and (3) Translate: performing high-fidelity, toxicity-preserving translation. The resulting dataset contains over 5,000 examples across six fine-grained safety categories. Despite using LLMs for scalability, our framework maintains rigorous human oversight, achieving 0.70-0.80 inter-annotator agreement. Evaluations of 13 state-of-the-art guardrails reveal significant performance degradation, underscoring the need for localized evaluation. RabakBench provides a reproducible framework for building safety benchmarks in underserved communities.

Paper Structure

This paper contains 44 sections, 2 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Example of unsafe Singlish text in RabakBench
  • Figure 2: Summary of our dataset construction pipeline
  • Figure 3: Overview of automated guardrail red-teaming. We employ both GPT-4oopenai2024gpt4ocard and DeepSeek-R1deepseekai2025deepseekr1incentivizingreasoningcapability to generate prompts designed to stress-test the guardrail's classification boundaries. This is Stage 1b in Figure \ref{['fig:overall-summary']}.
  • Figure 4: Source distribution Number of samples collected from each source.
  • Figure 5: Results from Alt-Testcalderon2025alternativeannotatortestllmasajudge across different multi-label classification metrics, where we identify Gemini 2.0 Flash, o3-mini-low, and Claude 3.5 Haiku to best align with our human annotators.
  • ...and 9 more figures