Table of Contents
Fetching ...

The Zeno's Paradox of `Low-Resource' Languages

Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, Monojit Choudhury

TL;DR

This work qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword ‘low-resource’ and shows how several interacting axes contribute to ‘low-resourcedness’ of a language and why that makes it difficult to track progress for each individual language.

Abstract

The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword `low-resource.' Based on our analysis, we show how several interacting axes contribute to `low-resourcedness' of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.

The Zeno's Paradox of `Low-Resource' Languages

TL;DR

This work qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword ‘low-resource’ and shows how several interacting axes contribute to ‘low-resourcedness’ of a language and why that makes it difficult to track progress for each individual language.

Abstract

The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword `low-resource.' Based on our analysis, we show how several interacting axes contribute to `low-resourcedness' of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.

Paper Structure

This paper contains 48 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Four overarching aspects that contribute to a language being classified as low-resource. Socio-political aspects are at the top, influencing both the availability of resources and the creation of artifacts. Community agency is a common thread in all the other three aspects.
  • Figure 2: Number of languages included in the studies per language family.
  • Figure 3: Criteria distribution used in the top-20 languages to categorize languages.
  • Figure 4: Language profiles for six languages across three classes based on data availability. The first row in each profile deals with socio-political issues, the second row resources, and the last row with artifacts (see Figure \ref{['fig:strata']}). We observe drastic differences between languages of the same class. See Appendix \ref{['apn:labels']} for details on the labels.
  • Figure 5: Distribution of criteria stated by papers in our study to categorize languages as low-resource.
  • ...and 2 more figures