Table of Contents
Fetching ...

Dealing with the Hard Facts of Low-Resource African NLP

Yacouba Diarra, Nouhoum Souleymane Coulibaly, Panga Azazia Kamaté, Madani Amadou Tall, Emmanuel Élisé Koné, Aymane Dembélé, Michael Leventhal

TL;DR

The paper tackles the scarcity of open, deployable resources for ASR in low-resource African languages by field-collecting 612 hours of spontaneous Bambara, applying a semi-automated transcription pipeline, and releasing a large open dataset with multiple evaluation sets. It evaluates monolingual ultra-compact and small models (including Parakeet-based and QuartzNet architectures) on in-domain and heterogeneous benchmarks, combining automatic metrics with human judgments to reveal gaps between string-based scores and native-speaker assessments. The study demonstrates substantial WER/CER improvements through finetuning and highlights the practical benefits of a human-in-the-loop transcription workflow, reducing annotation time and enabling deployment on modest hardware. Overall, it provides a concrete, transferable workflow for low-resource NLP that can extend to other Manding languages and similar contexts, advancing accessible speech technology for millions of speakers.

Abstract

Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.

Dealing with the Hard Facts of Low-Resource African NLP

TL;DR

The paper tackles the scarcity of open, deployable resources for ASR in low-resource African languages by field-collecting 612 hours of spontaneous Bambara, applying a semi-automated transcription pipeline, and releasing a large open dataset with multiple evaluation sets. It evaluates monolingual ultra-compact and small models (including Parakeet-based and QuartzNet architectures) on in-domain and heterogeneous benchmarks, combining automatic metrics with human judgments to reveal gaps between string-based scores and native-speaker assessments. The study demonstrates substantial WER/CER improvements through finetuning and highlights the practical benefits of a human-in-the-loop transcription workflow, reducing annotation time and enabling deployment on modest hardware. Overall, it provides a concrete, transferable workflow for low-resource NLP that can extend to other Manding languages and similar contexts, advancing accessible speech technology for millions of speakers.

Abstract

Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.

Paper Structure

This paper contains 16 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Statistics overview charts of the African Next Voices Bambara dataset: Age, Gender, Region and topics distribution. The first three charts are calculated with respect to the number of speakers while the topics distributions are expressed in durations. The locations represented as 'others' refer rural areas/villages around the 5 other main region
  • Figure 2: Density Distribution of Signal-to-Noise Ratio values in the African Next Voices Bambara Dataset. Note that the SNR values are not bounded.
  • Figure 3: WER vs human evaluation. Figure from tall_2025_17672774
  • Figure 4: The Labeling Interface for the African Next Voices Bambara Transcription Project. The interface shows the original audio waveform, the automatically generated pre-transcription, and the field for human correction/validation.