AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents
Abraham Toluwase Owodunni, Aditya Yadavalli, Chris Chinenye Emezue, Tobi Olatunji, Clinton C Mbataku
TL;DR
AccentFold tackles zero-shot ASR adaptation for diverse African English accents by learning accent embeddings via multitask learning on the XLS-R backbone using Afrispeech-200. The paper analyzes the geometry of the accent space with t-SNE, uncovering clustering patterns by language family and geography and revealing relationships not fully captured by Ethnologue. Empirically, selecting fine-tuning data based on AccentFold embeddings yields a measurable WER reduction on 41 OOD accents compared with random and geography-based baselines, demonstrating data-efficient improvements in robust accented ASR. The findings suggest that exploiting linguistic and geographic structure in accent representations can substantially improve cross-accent ASR performance, with implications for resource-constrained, multilingual contexts. The work also points to potential refinements in linguistic classifications (e.g., Kwa-Bantu, Niger-Congo subfamilies) informed by empirical speech-derived embeddings.
Abstract
Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents.
