Table of Contents
Fetching ...

AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents

Abraham Toluwase Owodunni, Aditya Yadavalli, Chris Chinenye Emezue, Tobi Olatunji, Clinton C Mbataku

TL;DR

AccentFold tackles zero-shot ASR adaptation for diverse African English accents by learning accent embeddings via multitask learning on the XLS-R backbone using Afrispeech-200. The paper analyzes the geometry of the accent space with t-SNE, uncovering clustering patterns by language family and geography and revealing relationships not fully captured by Ethnologue. Empirically, selecting fine-tuning data based on AccentFold embeddings yields a measurable WER reduction on 41 OOD accents compared with random and geography-based baselines, demonstrating data-efficient improvements in robust accented ASR. The findings suggest that exploiting linguistic and geographic structure in accent representations can substantially improve cross-accent ASR performance, with implications for resource-constrained, multilingual contexts. The work also points to potential refinements in linguistic classifications (e.g., Kwa-Bantu, Niger-Congo subfamilies) informed by empirical speech-derived embeddings.

Abstract

Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents.

AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents

TL;DR

AccentFold tackles zero-shot ASR adaptation for diverse African English accents by learning accent embeddings via multitask learning on the XLS-R backbone using Afrispeech-200. The paper analyzes the geometry of the accent space with t-SNE, uncovering clustering patterns by language family and geography and revealing relationships not fully captured by Ethnologue. Empirically, selecting fine-tuning data based on AccentFold embeddings yields a measurable WER reduction on 41 OOD accents compared with random and geography-based baselines, demonstrating data-efficient improvements in robust accented ASR. The findings suggest that exploiting linguistic and geographic structure in accent representations can substantially improve cross-accent ASR performance, with implications for resource-constrained, multilingual contexts. The work also points to potential refinements in linguistic classifications (e.g., Kwa-Bantu, Niger-Congo subfamilies) informed by empirical speech-derived embeddings.

Abstract

Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents.
Paper Structure (24 sections, 2 equations, 11 figures, 3 tables)

This paper contains 24 sections, 2 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Venn diagram of the accent splits
  • Figure 2: t-SNE visualization of the learned accent embeddings in AccentFold: embeddings of the entire Afrispeech-200 data. In this figure, each accent is encoded with one color. We use the color transparency to differentiate the accents, while the color categories represent the geographical region.
  • Figure 3: t-SNE visualization of embeddings by country from the Afrispeech test split.
  • Figure 4: Analysis of Dual Accents
  • Figure 5: Test WER across all 41 OOD accents. We compare AccentFold with random sampling.
  • ...and 6 more figures