Table of Contents
Fetching ...

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap

TL;DR

PolygloToxicityPrompts introduces the first large-scale multilingual benchmark for neural toxic degeneration, assembling 425K naturally occurring prompts across 17 languages from over 100M web-text documents and scoring them with Perspective API. By benchmarking 62 LLMs under varying prompt languages, model sizes, and alignment methods, the study reveals higher toxicity in low-resource languages and with larger base models, while instruction- and preference-tuning reduce toxicity with minimal differences between tuning methods. The work also contrasts toxicity detectors with safety classifiers, shows a measurable link between prompt and continuation toxicity, and highlights the influence of data sources on elicited toxicity. Overall, the paper underscores critical gaps in multilingual safeguarding and provides a scalable resource and methodology for advancing multilingual toxicity mitigation in LLMs.

Abstract

Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models

TL;DR

PolygloToxicityPrompts introduces the first large-scale multilingual benchmark for neural toxic degeneration, assembling 425K naturally occurring prompts across 17 languages from over 100M web-text documents and scoring them with Perspective API. By benchmarking 62 LLMs under varying prompt languages, model sizes, and alignment methods, the study reveals higher toxicity in low-resource languages and with larger base models, while instruction- and preference-tuning reduce toxicity with minimal differences between tuning methods. The work also contrasts toxicity detectors with safety classifiers, shows a measurable link between prompt and continuation toxicity, and highlights the influence of data sources on elicited toxicity. Overall, the paper underscores critical gaps in multilingual safeguarding and provides a scalable resource and methodology for advancing multilingual toxicity mitigation in LLMs.

Abstract

Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
Paper Structure (49 sections, 11 figures, 5 tables)

This paper contains 49 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: GPT-3.5-Turbo's Average Toxicity score on existing toxicity evaluation datasets, showing that PTP uncovers more toxicity in LLMs.
  • Figure 2: Summary of PolygloToxicityPrompts.
  • Figure 3: Language-wise AT trends for multilingual models. Takeaway: High toxicity scores (relative to the AT levels shown in Figure \ref{['fig:motivation_results']} and Table \ref{['tab:top_3_best_worst']}) for all languages indicate the need for multilingual toxicity mitigation methods.
  • Figure 4: Influence of model size on AT for Pythia suite. Takeaway: Toxicity increases with model size within a model family for base LLMs.
  • Figure 5: Influence of model size on AT in aligned models. Takeaway: Future work is required for safety-aligned LLMs.
  • ...and 6 more figures