Table of Contents
Fetching ...

From N-grams to Pre-trained Multilingual Models For Language Identification

Thapelo Sindane, Vukosi Marivate

TL;DR

The paper tackles language identification for 11 South African languages by comparing traditional N-gram and Naive Bayes methods against large pre-trained multilingual transformers, including Afri-centric variants. It demonstrates that while NB with word-level features performs strongly among baselines, Serengeti and other Afro-centric PLMs achieve the highest accuracy (around $98\%$ on average), with lightweight models offering competitive performance at reduced compute cost. The study also reveals cross-domain generalization patterns, highlights the importance of precision-focused evaluation for closely related languages, and provides model and code releases to facilitate practical LID deployment in low-resource settings. Overall, the work guides practitioners toward efficient, high-performance LID strategies tailored to South African languages and highlights directions for future research, such as embeddings and smaller, resource-efficient models.

Abstract

In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

From N-grams to Pre-trained Multilingual Models For Language Identification

TL;DR

The paper tackles language identification for 11 South African languages by comparing traditional N-gram and Naive Bayes methods against large pre-trained multilingual transformers, including Afri-centric variants. It demonstrates that while NB with word-level features performs strongly among baselines, Serengeti and other Afro-centric PLMs achieve the highest accuracy (around on average), with lightweight models offering competitive performance at reduced compute cost. The study also reveals cross-domain generalization patterns, highlights the importance of precision-focused evaluation for closely related languages, and provides model and code releases to facilitate practical LID deployment in low-resource settings. Overall, the work guides practitioners toward efficient, high-performance LID strategies tailored to South African languages and highlights directions for future research, such as embeddings and smaller, resource-efficient models.

Abstract

In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models -- mBERT, RemBERT, XLM-r, and Afri-centric multilingual models -- AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

Paper Structure

This paper contains 16 sections, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Sentence length distribution of Vuk corpora. The x-axis denotes the number of tokens (words) in the sentences.
  • Figure 2: Sentence length distribution of NCHLT + Vuk corpora. The x-axis denotes the number of tokens (words) in the sentences.
  • Figure 3: Data size variation performance on Vuk test data.
  • Figure 4: Score heatmap for all predictions using N-gram
  • Figure 5: Score heatmap for correctly predicted examples using N-gram
  • ...and 16 more figures