How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Benjamin Akera; Evelyn Nafula; Patrick Walukagga; Gilbert Yiga; John Quinn; Ernest Mwebaze

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Benjamin Akera, Evelyn Nafula, Patrick Walukagga, Gilbert Yiga, John Quinn, Ernest Mwebaze

TL;DR

This work tackles practical data requirements for ASR in low-resource African languages by evaluating Whisper large-v3 on Kinyarwanda (data-scaled from 1 to 1,400 hours) and Kikuyu (270 hours) to identify data-volume thresholds and primary failure modes. The results show that practical WER below $13\%$ is achievable with about 50 hours of data, dropping to around $9.82\%$ at 200 hours and $7.14\%$ with the full data for Kinyarwanda, while Kikuyu reveals a bimodal error distribution with a median WER of $26.3\%$ and a mean of $30.3\%$, including a notable tail of higher errors largely due to noisy ground truth ($38.6\%$ of high-error cases). The study underscores that data quality is as crucial as data volume, with a large share of errors stemming from annotation noise rather than model limitations, and it offers concrete deployment benchmarks and guidance for similar low-resource contexts. All code, models, and datasets are publicly released to support replication and practical adoption in related languages.

Abstract

The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

TL;DR

Abstract

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)