Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
Sukairaj Hafiz Imam, Tadesse Destaw Belay, Kedir Yassin Husse, Ibrahim Said Ahmad, Idris Abdulmumin, Hadiza Ali Umar, Muhammad Yahuza Bello, Joyce Nakatumba-Nabende, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad
TL;DR
This systematic literature review analyzes ASR for African low-resource languages using PRISMA to synthesize 71 studies (2020–2025) across 74 datasets and 111 languages, totaling roughly 11,206 hours of speech. It finds self-supervised and transfer learning approaches promising but constrained by data scarcity, dialectal gaps, and licensing uncertainties, with WER as the dominant evaluation metric and limited use of linguistically informed metrics like CER or DER. The study documents uneven dataset quality, recording conditions, and annotation standards, and highlights community-led initiatives (e.g., Masakhane, Naija Voices) as a pathway toward more inclusive, ethically governed resources. It suggests future directions including standardized benchmarks, multilingual and morpheme-aware modeling, lightweight architectures, and multi-stakeholder collaboration to improve digital inclusion for Africa’s multilingual societies.
Abstract
ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score <3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.
