Table of Contents
Fetching ...

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Benjamin Akera, Evelyn Nafula, Patrick Walukagga, Gilbert Yiga, John Quinn, Ernest Mwebaze

TL;DR

This work tackles practical data requirements for ASR in low-resource African languages by evaluating Whisper large-v3 on Kinyarwanda (data-scaled from 1 to 1,400 hours) and Kikuyu (270 hours) to identify data-volume thresholds and primary failure modes. The results show that practical WER below $13\%$ is achievable with about 50 hours of data, dropping to around $9.82\%$ at 200 hours and $7.14\%$ with the full data for Kinyarwanda, while Kikuyu reveals a bimodal error distribution with a median WER of $26.3\%$ and a mean of $30.3\%$, including a notable tail of higher errors largely due to noisy ground truth ($38.6\%$ of high-error cases). The study underscores that data quality is as crucial as data volume, with a large share of errors stemming from annotation noise rather than model limitations, and it offers concrete deployment benchmarks and guidance for similar low-resource contexts. All code, models, and datasets are publicly released to support replication and practical adoption in related languages.

Abstract

The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

TL;DR

This work tackles practical data requirements for ASR in low-resource African languages by evaluating Whisper large-v3 on Kinyarwanda (data-scaled from 1 to 1,400 hours) and Kikuyu (270 hours) to identify data-volume thresholds and primary failure modes. The results show that practical WER below is achievable with about 50 hours of data, dropping to around at 200 hours and with the full data for Kinyarwanda, while Kikuyu reveals a bimodal error distribution with a median WER of and a mean of , including a notable tail of higher errors largely due to noisy ground truth ( of high-error cases). The study underscores that data quality is as crucial as data volume, with a large share of errors stemming from annotation noise rather than model limitations, and it offers concrete deployment benchmarks and guidance for similar low-resource contexts. All code, models, and datasets are publicly released to support replication and practical adoption in related languages.

Abstract

The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

Paper Structure

This paper contains 10 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Word Error Rate and Character Error Rate by training data volume for Kinyarwanda. Both metrics show consistent improvement with increased data, with the most substantial gains in the first 200 hours.
  • Figure 2: Validation Word Error Rate evolution during training for different Kinyarwanda dataset sizes. Training time is shown in seconds on a single H100 GPU, with the longest run requiring 20.2 hours.
  • Figure 3: Distribution of Word Error Rates across 6,910 Kikuyu evaluation samples. The histogram shows high concentration in the 0-50% range with a long tail of higher-error cases.
  • Figure 4: Word Error Rate distribution by performance category for Kikuyu evaluation data. Box plots show median, quartiles, and outliers for each quality band.