Table of Contents
Fetching ...

GlotLID: Language Identification for Low-Resource Languages

Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze

TL;DR

The paper addresses the need for robust language identification across a wide spectrum of low-resource languages by introducing GlotLID-C, a large, carefully curated dataset spanning 1832 languages, and GlotLID-M, an open-source FastText-based LID model trained on GlotLID-C that covers 1665 languages. GlotLID-M demonstrates superior performance to several baselines on UDHR and FLORES benchmarks, particularly when optimizing for a balance between F1 and FPR in realistic deployment scenarios that allow for unknown languages. The work also analyzes common LID challenges in low-resource settings, such as noise in web-crawled data, macrolanguage versus varieties, and data contamination, and emphasizes the importance of calibrated confidence thresholds for high-quality corpus creation. The authors provide a comprehensive resource for creating high-quality multilingual corpora and advancing NLP for many low-resource languages, while outlining limitations and avenues for future improvements and evaluation refinements.

Abstract

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model (including future versions), code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

GlotLID: Language Identification for Low-Resource Languages

TL;DR

The paper addresses the need for robust language identification across a wide spectrum of low-resource languages by introducing GlotLID-C, a large, carefully curated dataset spanning 1832 languages, and GlotLID-M, an open-source FastText-based LID model trained on GlotLID-C that covers 1665 languages. GlotLID-M demonstrates superior performance to several baselines on UDHR and FLORES benchmarks, particularly when optimizing for a balance between F1 and FPR in realistic deployment scenarios that allow for unknown languages. The work also analyzes common LID challenges in low-resource settings, such as noise in web-crawled data, macrolanguage versus varieties, and data contamination, and emphasizes the importance of calibrated confidence thresholds for high-quality corpus creation. The authors provide a comprehensive resource for creating high-quality multilingual corpora and advancing NLP for many low-resource languages, while outlining limitations and avenues for future improvements and evaluation refinements.

Abstract

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model (including future versions), code, and list of data sources are available: https://github.com/cisnlp/GlotLID.
Paper Structure (20 sections, 1 equation, 2 figures, 29 tables)

This paper contains 20 sections, 1 equation, 2 figures, 29 tables.

Figures (2)

  • Figure 1: Decision rule for assigning classes (i.e., languages) in language identification
  • Figure 2: Reliability diagram for GlotLID-M on GlotLID-C test