Table of Contents
Fetching ...

Invisible Languages of the LLM Universe

Saurabh Khanna, Xinxu Li

TL;DR

This paper addresses the paradox of massive multilingual data coexisting with widespread linguistic invisibility in AI systems by introducing a vitality–digitality framework that treats linguistic vitality and online presence as orthogonal. It defines a representation score $R = Digitality_{normalized} - Vitality_{normalized}$ and classifies all 7,613 documented languages into four categories, notably identifying Invisible Giants (+$Vitality$ / $-$Digitality) that include millions of speakers yet lack digital representation. The authors argue that the observed disparities reflect digital epistemic injustice rooted in postcolonial power structures and the architectural choices of AI ecosystems, not mere data scarcity, and they connect these findings to historical practices in missionary linguistics, ISO standards, and platform design. They propose concrete decolonization strategies for AI development, including community-governed datasets, new evaluation metrics that emphasize non-dominant linguistic features, and policy interventions to expand digital infrastructure for underrepresented languages, aiming to democratize access to AI benefits.

Abstract

Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.

Invisible Languages of the LLM Universe

TL;DR

This paper addresses the paradox of massive multilingual data coexisting with widespread linguistic invisibility in AI systems by introducing a vitality–digitality framework that treats linguistic vitality and online presence as orthogonal. It defines a representation score and classifies all 7,613 documented languages into four categories, notably identifying Invisible Giants (+ / Digitality) that include millions of speakers yet lack digital representation. The authors argue that the observed disparities reflect digital epistemic injustice rooted in postcolonial power structures and the architectural choices of AI ecosystems, not mere data scarcity, and they connect these findings to historical practices in missionary linguistics, ISO standards, and platform design. They propose concrete decolonization strategies for AI development, including community-governed datasets, new evaluation metrics that emphasize non-dominant linguistic features, and policy interventions to expand digital infrastructure for underrepresented languages, aiming to democratize access to AI benefits.

Abstract

Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.

Paper Structure

This paper contains 25 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Mapping 7613 human languages on vitality (ground presence) and digitality (web presence).
  • Figure 2: Geolocating invisible languages
  • Figure 3: Strongholds
  • Figure 4: Digital Echoes
  • Figure 5: Fading Voices
  • ...and 1 more figures