Table of Contents
Fetching ...

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers

Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez

TL;DR

It is shown that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users, and underscores the need for fairer algorithm development to benefit all linguistic groups.

Abstract

Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI's GPT models via APIs because of how the system processes the input -- tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers

TL;DR

It is shown that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users, and underscores the need for fairer algorithm development to benefit all linguistic groups.

Abstract

Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI's GPT models via APIs because of how the system processes the input -- tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.

Paper Structure

This paper contains 21 sections, 7 equations, 3 figures.

Figures (3)

  • Figure 1: The figure summarizes the double jeopardy in low-resource languages---such as Shan, Santhali, Dzongkha, Tamasheq, Kabiyè, Nuer---mostly spoken in low- and lower-middle-income countries. The cost of using LLMs is higher for these languages when the pricing is based on tokenization. The performance of LLMs in these languages is also poor. This shows results using tokenizers for GPT-4 and GPT-4o. The trendlines suggest that the GPT-4o tokenizer has generally reduced fragmentation. Derivation of the values used in this figure is detailed in Section \ref{['method']}.
  • Figure 2: Visualization of tokenization for an equivalent sentence in English and Telugu (https://platform.openai.com/tokenizer). Note the number of tokens for each language after applying the tokenizer. Despite Telugu having fewer characters than the English equivalent, English only has 49 tokens, while Telugu resulted in 360 tokens. A fragmentation rate of around seven times.
  • Figure 3: Ethnologue page showing the various locations where the Garifuna language—mainly spoken in Honduras—is also spoken. The population data are reported based on different sources and points in time.