Swivuriso: The South African Next Voices Multilingual Speech Dataset
Vukosi Marivatee, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit, Tsosheletso Chidi, Rooweither Mabuya, Andiswa Bukula, Respect Mlambo, Tebogo Macucwa, Idris Abdulmumin, and Seani Rananga
TL;DR
Swivuriso addresses the paucity of large-scale, diverse ASR resources for South African languages by delivering a 3000-hour, multilingual corpus spanning seven languages and three domains (agriculture, healthcare, general). The dataset combines scripted and unscripted speech produced through participatory, ethically governed collection, with careful prompt design, processing, and privacy protections. Baseline experiments across cross-corpus, monolingual, and multilingual settings demonstrate consistent improvements in word error rate when training or fine-tuning state-of-the-art ASR models on Swivuriso, and show data-volume effects on performance. The work establishes Swivuriso as a reusable, ethically grounded resource that can accelerate inclusive speech technology development for underrepresented African languages and inform governance models for future datasets.
Abstract
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
