1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Sewade Ogun, Abraham T. Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin Adewumi
TL;DR
This paper tackles the under-representation of African voices in speech synthesis by introducing Afro-TTS, a pan-African accented English TTS system capable of generating 86 African accents with a large, diverse set of personas. It builds Afro-TTS on a crowdsourced dataset of 747 speakers across 9 countries (136 hours) and leverages fine-tuned VITS and XTTS models, augmented with speaker interpolation to create 200+ additional voices. The study presents an extensive evaluation, showing XTTS-FT achieving high subjective naturalness and accentedness, while highlighting regional diversity and the viability of interpolated voices. The work advances practical African voice representation in TTS with implications for Education, Public Health, and content creation, while noting limitations related to accent balance and privacy considerations.
Abstract
Recent advances in speech synthesis have enabled many useful applications like audio directions in Google Maps, screen readers, and automated content generation on platforms like TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 of the world's languages are domiciled in Africa, African voices and personas are under-represented in these systems. As speech synthesis becomes increasingly democratized, it is desirable to increase the representation of African English accents. We present Afro-TTS, the first pan-African accented English speech synthesis system able to generate speech in 86 African accents, with 1000 personas representing the rich phonological diversity across the continent for downstream application in Education, Public Health, and Automated Content Creation. Speaker interpolation retains naturalness and accentedness, enabling the creation of new voices.
