Table of Contents
Fetching ...

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivatee, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit, Tsosheletso Chidi, Rooweither Mabuya, Andiswa Bukula, Respect Mlambo, Tebogo Macucwa, Idris Abdulmumin, and Seani Rananga

TL;DR

Swivuriso addresses the paucity of large-scale, diverse ASR resources for South African languages by delivering a 3000-hour, multilingual corpus spanning seven languages and three domains (agriculture, healthcare, general). The dataset combines scripted and unscripted speech produced through participatory, ethically governed collection, with careful prompt design, processing, and privacy protections. Baseline experiments across cross-corpus, monolingual, and multilingual settings demonstrate consistent improvements in word error rate when training or fine-tuning state-of-the-art ASR models on Swivuriso, and show data-volume effects on performance. The work establishes Swivuriso as a reusable, ethically grounded resource that can accelerate inclusive speech technology development for underrepresented African languages and inform governance models for future datasets.

Abstract

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

Swivuriso: The South African Next Voices Multilingual Speech Dataset

TL;DR

Swivuriso addresses the paucity of large-scale, diverse ASR resources for South African languages by delivering a 3000-hour, multilingual corpus spanning seven languages and three domains (agriculture, healthcare, general). The dataset combines scripted and unscripted speech produced through participatory, ethically governed collection, with careful prompt design, processing, and privacy protections. Baseline experiments across cross-corpus, monolingual, and multilingual settings demonstrate consistent improvements in word error rate when training or fine-tuning state-of-the-art ASR models on Swivuriso, and show data-volume effects on performance. The work establishes Swivuriso as a reusable, ethically grounded resource that can accelerate inclusive speech technology development for underrepresented African languages and inform governance models for future datasets.

Abstract

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

Paper Structure

This paper contains 22 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Geographic distribution of Swivuriso across South African provinces. Each province is shaded according to the total hours of recorded speech. The labels display the province name, with the dominant language indicated in brackets.
  • Figure 2: Overview of Swivuriso creation workflow.
  • Figure 3: Histogram showing how many clips each speaker contributed in the train, dev, and devtest splits.
  • Figure 4: Distribution of clip durations across train, dev, and devtest splits. All splits display a highly similar distribution pattern, with the majority of clips concentrated between 10 and 30 seconds.
  • Figure 5: Distribution of word count per clip by speech type (scripted vs. unscripted) across four South African languages. Each subplot displays histograms and KDE curves overlaid for both types, with vertical dashed lines marking median values.
  • ...and 1 more figures