Table of Contents
Fetching ...

NaijaNLP: A Survey of Nigerian Low-Resource Languages

Isa Inuwa-Dutse

TL;DR

This paper addresses the scarcity of NLP resources for Nigeria's three major languages (Hausa, Igbo, Yorùbá) by conducting a structured review of LR-NLP literature, resources, and tools. It demonstrates that about 25.1% of studies introduce new linguistic resources and highlights the reliance on repurposed data, with diacritic and tonal representations under-explored. The authors synthesize formal grammar, language particularities, datasets, and downstream tasks, and propose a dedicated resource hub plus strategies including resource enrichment, multilingual transfer, and shared tasks. The work underscores the need for language-specific data collection, open collaboration, and indigenous modelling to advance NaijaNLP, with implications for broader low-resource NLP.

Abstract

With over 500 languages in Nigeria, three languages -- Hausa, Yorùbá and Igbo -- spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yorùbá, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.

NaijaNLP: A Survey of Nigerian Low-Resource Languages

TL;DR

This paper addresses the scarcity of NLP resources for Nigeria's three major languages (Hausa, Igbo, Yorùbá) by conducting a structured review of LR-NLP literature, resources, and tools. It demonstrates that about 25.1% of studies introduce new linguistic resources and highlights the reliance on repurposed data, with diacritic and tonal representations under-explored. The authors synthesize formal grammar, language particularities, datasets, and downstream tasks, and propose a dedicated resource hub plus strategies including resource enrichment, multilingual transfer, and shared tasks. The work underscores the need for language-specific data collection, open collaboration, and indigenous modelling to advance NaijaNLP, with implications for broader low-resource NLP.

Abstract

With over 500 languages in Nigeria, three languages -- Hausa, Yorùbá and Igbo -- spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yorùbá, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.

Paper Structure

This paper contains 57 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the review process, highlighting (1) guiding questions and search keywords (2) search databases and corresponding retrieved results, and (3) the screening process. Variations in the final article counts are due to overlapping entries retrieved from Google's search engine and other academic databases.
  • Figure 2: Frequency of research output and publication venues for NaijaNLP.