Table of Contents
Fetching ...

Benchmarking LLM Guardrails in Handling Multilingual Toxicity

Yahan Yang, Soham Dan, Dan Roth, Insup Lee

TL;DR

This work tackles the challenge of enforcing safe behavior in LLMs across multiple languages by building a comprehensive multilingual guardrail evaluation framework. It introduces a seven-dataset test suite with translations across ten languages and benchmarks state-of-the-art open-source guardrails, revealing consistent underperformance on non-English inputs and vulnerabilities to multilingual jailbreaking, including code-switching. The study shows that in-context safety policies and language resource availability significantly affect guardrail effectiveness, with customization offering notable gains. The findings highlight critical gaps in current guardrails for multilingual toxicity and provide guidance for more robust, language-inclusive safety mechanisms in practical deployments.

Abstract

With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and trustworthy LLMs in multilingual scenarios.

Benchmarking LLM Guardrails in Handling Multilingual Toxicity

TL;DR

This work tackles the challenge of enforcing safe behavior in LLMs across multiple languages by building a comprehensive multilingual guardrail evaluation framework. It introduces a seven-dataset test suite with translations across ten languages and benchmarks state-of-the-art open-source guardrails, revealing consistent underperformance on non-English inputs and vulnerabilities to multilingual jailbreaking, including code-switching. The study shows that in-context safety policies and language resource availability significantly affect guardrail effectiveness, with customization offering notable gains. The findings highlight critical gaps in current guardrails for multilingual toxicity and provide guidance for more robust, language-inclusive safety mechanisms in practical deployments.

Abstract

With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and trustworthy LLMs in multilingual scenarios.

Paper Structure

This paper contains 14 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The guardrails are providing inconsistent predictions for English and Thai inputs with the same semantic meaning. In this case, we are using LLaMa-Guard-3 as the guardrail model which supports Thai.
  • Figure 2: F1 score of different models on Aegis dataset across different languages.
  • Figure 3: False Positive Rate of different models on Aegis dataset across different languages.
  • Figure 4: F1 Score of different models on Multijail dataset across different languages.
  • Figure 5: Examples of different multilingual jailbreaking prompts we evaluted in our experiments. (Top rorw is for IT, and bottom row is for KO.)
  • ...and 6 more figures