Table of Contents
Fetching ...

The Multilingual Divide and Its Impact on Global AI Safety

Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet Üstün, Matthias Gallé, Marzieh Fadaee, Sara Hooker

TL;DR

The paper tackles the global language gap in AI safety, arguing that current LLM progress is English-centric and inequitable across languages and cultures. It presents the Aya Initiative as a practical, scalable approach to expand language coverage through diverse data sources, multilingual evaluation, and collaborative governance. Key contributions include new multilingual datasets and evaluation suites (Global-MMLU, INCLUDE, Aya Red-teaming), techniques for multilingual safety (Safety Context Distillation), and evidence from Aya on data mix and model merging that improve safety and coverage. The work underscores policy imperatives: open multilingual datasets, transparent language coverage, cross-institutional collaboration, and improved compute access to ensure safe AI across all languages with diverse cultural contexts.

Abstract

Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.

The Multilingual Divide and Its Impact on Global AI Safety

TL;DR

The paper tackles the global language gap in AI safety, arguing that current LLM progress is English-centric and inequitable across languages and cultures. It presents the Aya Initiative as a practical, scalable approach to expand language coverage through diverse data sources, multilingual evaluation, and collaborative governance. Key contributions include new multilingual datasets and evaluation suites (Global-MMLU, INCLUDE, Aya Red-teaming), techniques for multilingual safety (Safety Context Distillation), and evidence from Aya on data mix and model merging that improve safety and coverage. The work underscores policy imperatives: open multilingual datasets, transparent language coverage, cross-institutional collaboration, and improved compute access to ensure safe AI across all languages with diverse cultural contexts.

Abstract

Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.

Paper Structure

This paper contains 13 sections, 6 figures.

Figures (6)

  • Figure 1: Bridging the Multilingual Divide: We scrutinize the reasons for the language gap in AI, and review and recommend concrete steps to bridging it. We highlight that the language gap must involve safety mitigation across languages, and that open challenges remain.
  • Figure 2: The language gap is clearly visible in the availability of textual datasets across two popular sources: HuggingFace and Wikipedia. Circles represent the number of HuggingFace datasets including text per size tag and mentioning a given language. Color indicates the number of Wikipedia pages in the same language, for the six most frequent languages and a diverse selection of lower-resource languages (source: ranathunga-de-silva-2022-languages).
  • Figure 3: ChatGPT requires a greater number of tokens to encode the same contents across language scripts that are less well resourced (FLORES datasets flores101, data from ahia_all_2023). The number in brackets indicates the count of languages encoded in each script.
  • Figure 4: Results from shen-etal-2024-language: Lower-resource languages have a higher rate of harmful and irrelevant generations by GPT-4 than higher-resource languages.zou2023universaltransferableadversarialattacks.
  • Figure 5: Of examples in MMLU requiring cultural or regionally-specific knowledge to answer correctly, the majority are geographically tied to North America and dominated by Western culture (from singh2025globalmmluunderstandingaddressing)
  • ...and 1 more figures