Table of Contents
Fetching ...

Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

Wei-Rui Chen, Ife Adebara, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed

TL;DR

The study probes ChatGPT's language identification across Babel-670, revealing substantial gaps in coverage, particularly for African languages. It introduces two prompt paradigms (language names and language codes) and a postprocessing/ADA evaluation framework to assess zero- and few-shot LID under varied label-set conditions. GPT-4 generally outperforms GPT-3.5, but many languages still yield zero F1, and performance is strongly shaped by script distinctiveness and geographic distribution. The work highlights the need for expanding language support in LLMs and provides a benchmarked, analysis-driven path toward more inclusive multilingual NLP tools.

Abstract

ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT `knows', we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 24 language families spoken in five continents. Languages in Babel-670 run the gamut from the very high-resource to the very low-resource. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify language names and language codes (ii) under zero- and few-shot conditions (iii) with and without provision of a label set. When compared to smaller finetuned LID tools, we find that ChatGPT lags behind. For example, it has poor performance on African languages. We conclude that current large language models would benefit from further development before they can sufficiently serve diverse communities.

Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

TL;DR

The study probes ChatGPT's language identification across Babel-670, revealing substantial gaps in coverage, particularly for African languages. It introduces two prompt paradigms (language names and language codes) and a postprocessing/ADA evaluation framework to assess zero- and few-shot LID under varied label-set conditions. GPT-4 generally outperforms GPT-3.5, but many languages still yield zero F1, and performance is strongly shaped by script distinctiveness and geographic distribution. The work highlights the need for expanding language support in LLMs and provides a benchmarked, analysis-driven path toward more inclusive multilingual NLP tools.

Abstract

ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT `knows', we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 24 language families spoken in five continents. Languages in Babel-670 run the gamut from the very high-resource to the very low-resource. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify language names and language codes (ii) under zero- and few-shot conditions (iii) with and without provision of a label set. When compared to smaller finetuned LID tools, we find that ChatGPT lags behind. For example, it has poor performance on African languages. We conclude that current large language models would benefit from further development before they can sufficiently serve diverse communities.
Paper Structure (25 sections, 5 figures, 20 tables)

This paper contains 25 sections, 5 figures, 20 tables.

Figures (5)

  • Figure 1: A choropleth map where the intensity indicates the averaged F1 score of languages spoken in each region. It can be seen that the support of languages has geographical discrepancy, e.g. with African languages being strikingly less supported. The figure is drawn based on the results of one of our experimental settings: (Language Name Prompt [Alias-Dialect-accepting], GPT-4, hard, 0-shot; see Section \ref{['sec:methodology']} for more details). A larger map is available in Figure \ref{['fig:language_world_map_large']} in the Appendix.
  • Figure 2: An Overview of different experimental settings with exemplified predictions and test examples in French (fra), Spanish (spa), Southwestern Dinka (dik). Language name prompt (LNP) has both exact-match and alias-dialect-accepting evaluation while language code prompt (LCP) has solely exact-match evaluation. The prediction of third test example (Northeastern Dinka) of LNP is considered incorrect in exact-match evaluation but correct in alias-dialect-accepting evaluation.
  • Figure 3: Languages with different ranges of F1 scores ($\%$). 382 languages with zero $F_1$ score are not included in this figure but are reported in Appendix Table \ref{['tab:lang_names_zero_f1_LNP_dialect_accepting']}. It shows a M-shape bimodal distribution where both extremes, zero F1 score for $382$ languages and $>90\%$ F1 score for $100$ languages, take up most languages ($\sim 500$ languages). This is of setting (LNP [alias-dialect-accepting], GPT-4, hard, 0-shot).
  • Figure A.1: A larger choropleth map where the intensity indicates the averaged F1 score of languages spoken in each region. It can been that the support of languages has geographical discrepancy with African languages being less supported. The figure is drawn based on the results of one of our experimental setting: (Language Name Prompt [Alias-Dialect-accepting], GPT-4, hard, 0-shot)
  • Figure C.1: An overview of alias-dialect-accepting (ADA) evaluation for language name prompt (LNP).