Artificial intelligence is creating a new global linguistic hierarchy

Giulia Occhini; Kumiko Tanaka-Ishii; Anna Barford; Refael Tikochinski; Songbo Hu; Roi Reichart; Yijie Zhou; Hannah Claus; Ulla Petti; Ivan Vulić; Ramit Debnath; Anna Korhonen

Artificial intelligence is creating a new global linguistic hierarchy

Giulia Occhini, Kumiko Tanaka-Ishii, Anna Barford, Refael Tikochinski, Songbo Hu, Roi Reichart, Yijie Zhou, Hannah Claus, Ulla Petti, Ivan Vulić, Ramit Debnath, Anna Korhonen

TL;DR

A global longitudinal analysis of social, economic and infrastructural conditions across languages to assess systemic inequalities in language AI is presented, and the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages is introduced.

Abstract

Artificial intelligence (AI) has the potential to transform healthcare, education, governance and socioeconomic equity, but its benefits remain concentrated in a small number of languages (Bender, 2019; Blasi et al., 2022; Joshi et al., 2020; Ranathunga and de Silva, 2022; Young, 2015). Language AI - the technologies that underpin widely-used conversational systems such as ChatGPT - could provide major benefits if available in people's native languages, yet most of the world's 7,000+ linguistic communities currently lack access and face persistent digital marginalization. Here we present a global longitudinal analysis of social, economic and infrastructural conditions across languages to assess systemic inequalities in language AI. We first analyze the existence of AI resources for 6003 languages. We find that despite efforts of the community to broaden the reach of language technologies (Bapna et al., 2022; Costa-Jussà et al., 2022), the dominance of a handful of languages is exacerbating disparities on an unprecedented scale, with divides widening exponentially rather than narrowing. Further, we contrast the longitudinal diffusion of AI with that of earlier IT technologies, revealing a distinctive hype-driven pattern of spread. To translate our findings into practical insights and guide prioritization efforts, we introduce the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages. The index highlights communities where capacity exists but remains underutilized, and provides a framework for accelerating more equitable diffusion of language AI. Our work contributes to setting the baseline for a transition towards more sustainable and equitable language technologies.

Artificial intelligence is creating a new global linguistic hierarchy

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 15 figures, 7 tables)

This paper contains 14 sections, 5 equations, 15 figures, 7 tables.

Data availability statement
Existing data
Data exclusion
Missing data
Indicator justification
Supplementary figures
Validation of Huggingface data against ACL antology corpus
Residuals of Zipf's models
Geographical distribution of resources
Technological diffusion parameters
Data analysis
Surveys output
Respondents characteristics
Survey results

Figures (15)

Figure 1: Evolution of Language Model and Dataset Distribution on Hugging Face over time (2020-2024) based on Wayback Machine data (Source: HuggingFace). These log-log plots illustrate Zipf’s Law, showing the frequency of models (left) and datasets (right) per language rank over time. The dashed blue line indicates an ideal Zipfian distribution with an alpha ($\alpha$=1) of 1 for the year 2024. We observe how the distribution of resources per language reached a Zipf distribution in record time. English (red dot) is a major outlier, with an order of magnitude more models than expected, indicating extreme dominance. Other high-resource languages like French, Spanish, and Chinese follow the Zipf trend more closely.
Figure 2: Global Distribution of Languages by Language Models and Speaker Population. We plot the relationship between number of speakers and number of online language models on a log-log scale. Languages are grouped into four categories: "Mid-tier" (blue), "Dead" (black), "Under-resourced" (orange), and "Top 5 Languages per Speaker Bin" (green). By “Top 5 Languages per Speaker Bin” we refer to the five languages with the most language models within each speaker population range. A purple OLS regression line indicates the expected trend (with parameters $\beta_1 = 0.312, \ p < 0.001$, $R^2 = 0.304$), with languages falling below it, despite having over 1 million speakers, classified as under-resourced. Each group is plotted on world maps showing their geographic distribution, with expanded subplots in larger size available in the Supplementary Figures in \ref{['geo_bigger']}.
Figure 3: Linearized Gompertz transformation of adoption data for Language Models, Mobile Phones, Fixed Broadband, Electric Vehicles and Personal Computers. Growth for each technology progresses from the bottom-left (early adoption) towards the upper-right (approaching saturation). We normalise both adoption values (to maximum) and time (Z-score) to enable a cross-technology comparison.
Figure 4: Bar plots displaying the magnitude and direction of the component loadings for 24 variables on the first two principal components (PC1 and PC2) after Varimax rotation. PC1 (top panel) shows a strong positive contribution from socioeconomic and digital infrastructure variables, with the highest loading strengths observed for the proportion of individuals using the internet (0.35), the Human Development Index (HDI; 0.35), and education level (0.33). Network latency provides the sole significant negative loading ($-0.18$). PC2 (bottom panel) is primarily defined by the availability of AI resources, with the highest loading strengths coming from Wikipedia active users (0.35), the number of datasets (0.35), and CommonCrawl data volume (0.35). The clear separation of variable groups across the two components graphically confirms the distinct factors identified by the PCA.
Figure 5: Demonstration of our open access AI language readiness tool. In subfigure a, we show the differential in distribution between languages having AI resources and languages being socio-technically ready for AI. We observe how while many languages might have a fraction of the language modeling resources available compared to English, they still do have the readiness in terms of socioeconomics and digital infrastructure. Below, in subfigure b we show the top ranking languages in our index ranked by overall score, available at https://www.equate-index.ai/. In subfigure c we show the global explorer feature of our index. Blue circles indicate language clusters, with the number inside showing the count of languages in each region. The markers are interactive, allowing zoom and click to reveal individual languages and their AI readiness metrics. Filters on the left enable customization by overall ranking and minimum speaker count.
...and 10 more figures

Artificial intelligence is creating a new global linguistic hierarchy

TL;DR

Abstract

Artificial intelligence is creating a new global linguistic hierarchy

Authors

TL;DR

Abstract

Table of Contents

Figures (15)