Table of Contents
Fetching ...

The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Sheriff Issaka, Keyi Wang, Yinka Ajibola, Oluwatumininu Samuel-Ipaye, Zhaoyi Zhang, Nicte Aguillon Jimenez, Evans Kofi Agyei, Abraham Lin, Rohan Ramachandran, Sadick Abdul Mumin, Faith Nchifor, Mohammed Shuraim, Lieqi Liu, Erick Rosas Gonzalez, Sylvester Kpei, Jemimah Osei, Carlene Ajeneza, Persis Boateng, Prisca Adwoa Dufie Yeboah, Saadia Gabriel

TL;DR

The African Languages Lab (All Lab) addresses the severe underrepresentation of African languages in NLP by building systematic data infrastructure, a large-scale multimodal dataset, and a capacity-building program. It introduces All Voices, a mobile-first platform for direct translation between 40 African languages, yielding 19 billion tokens of monolingual text and 12,628 hours of aligned speech, which are used to fine-tune Llama-3.2-1B with full supervision, achieving substantial average gains across metrics (+23.69 ChrF++, +0.33 COMET, +15.34 BLEU) on 31 languages and competitive performance against Google Translate in several cases. The contributions—the validated dataset, the data-collection platform, and the mentorship program—demonstrate that targeted data collection and local capacity development can meaningfully reduce linguistic marginalization in NLP and enable more equitable access to language technology. This work highlights the practical impact of coordinated community engagement, robust infrastructure, and responsible governance for sustaining NLP advances for Africa’s diverse linguistic landscape.

Abstract

Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.

The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

TL;DR

The African Languages Lab (All Lab) addresses the severe underrepresentation of African languages in NLP by building systematic data infrastructure, a large-scale multimodal dataset, and a capacity-building program. It introduces All Voices, a mobile-first platform for direct translation between 40 African languages, yielding 19 billion tokens of monolingual text and 12,628 hours of aligned speech, which are used to fine-tune Llama-3.2-1B with full supervision, achieving substantial average gains across metrics (+23.69 ChrF++, +0.33 COMET, +15.34 BLEU) on 31 languages and competitive performance against Google Translate in several cases. The contributions—the validated dataset, the data-collection platform, and the mentorship program—demonstrate that targeted data collection and local capacity development can meaningfully reduce linguistic marginalization in NLP and enable more equitable access to language technology. This work highlights the practical impact of coordinated community engagement, robust infrastructure, and responsible governance for sustaining NLP advances for Africa’s diverse linguistic landscape.

Abstract

Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.

Paper Structure

This paper contains 24 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Number of endangered languages in each African country, where darker shading indicates a higher number of endangered languages.
  • Figure 2: The All Voices platform interface demonstrating its dual functionality: direct text translation from English to Yoruba (left panel) and community-driven translation validation system (right panel). The mobile-first design enables participation from users with limited technical resources.
  • Figure 3: End-to-end processing pipeline showing multi-source integration, language-specific preprocessing, and statistical validation ensuring dataset quality.
  • Figure 4: Afrikaans Translation
  • Figure 5: Amharic Translation
  • ...and 8 more figures