Table of Contents
Fetching ...

Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context

Antoine Caubrière, Elodie Gauthier

TL;DR

This work addresses the underrepresentation of sub-Saharan African languages in self-supervised speech models by training an Africa-centric SSL model exclusively on SSA speech. Using a HuBERT-base architecture, the authors pretrain on nearly 60,000 hours across 21 SSA languages and evaluate on the SSA subset of FLEURS, achieving competitive ASR results with substantially less data and fewer parameters than a strong baseline. The SSA-focused model also yields a marked improvement in language identification, surpassing FLEURS baselines by over 22 percentage points. The study provides an open-source SSA SSL resource and demonstrates the practical impact of region-specific pretraining for improving robustness and performance in SSA ASR and LID tasks.

Abstract

We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT$_{base}$ (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22\%.

Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context

TL;DR

This work addresses the underrepresentation of sub-Saharan African languages in self-supervised speech models by training an Africa-centric SSL model exclusively on SSA speech. Using a HuBERT-base architecture, the authors pretrain on nearly 60,000 hours across 21 SSA languages and evaluate on the SSA subset of FLEURS, achieving competitive ASR results with substantially less data and fewer parameters than a strong baseline. The SSA-focused model also yields a marked improvement in language identification, surpassing FLEURS baselines by over 22 percentage points. The study provides an open-source SSA SSL resource and demonstrates the practical impact of region-specific pretraining for improving robustness and performance in SSA ASR and LID tasks.

Abstract

We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22\%.
Paper Structure (8 sections, 4 tables)