Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo
Audrey Mbogho, Quin Awuor, Andrew Kipkebut, Lilian Wanzare, Vivian Oloo
TL;DR
The paper tackles the scarcity of NLP resources for three under-resourced Kenyan languages—Kidaw’ida, Kalenjin, and Dholuo—by presenting a one-year selective crowdsourcing case study. It collects text and speech data, generates parallel Kiswahili corpora, and publicly releases the resources on Zenodo and Mozilla Common Voice to enable model training and NLP development. The approach demonstrates how targeted community engagement and open data can begin to close resource gaps, producing 30,000 sentences per language and substantial speech datasets despite modest funding. The work highlights practical implications for inclusive AI in Africa, emphasizing health, education, and development applications, and outlines challenges around sustainability, licensing, and scaling data collection. It also serves as a blueprint for similar initiatives across Africa, encouraging ongoing collaboration from native speakers and developers.
Abstract
Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.
