Towards Robust Speech Representation Learning for Thousands of Languages

William Chen; Wangyou Zhang; Yifan Peng; Xinjian Li; Jinchuan Tian; Jiatong Shi; Xuankai Chang; Soumi Maiti; Karen Livescu; Shinji Watanabe

Towards Robust Speech Representation Learning for Thousands of Languages

William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

TL;DR

XEUS tackles the challenge of speech representation for thousands of languages by scaling SSL pre-training to more than $10^6$ hours and $4{,}057$ languages. It introduces a novel acoustic-dereverberation objective and uses an open, large-scale pre-training dataset; architecture is an $19$-layer E-Branchformer with HuBERT-style masked prediction and WavLM denoising. It achieves state-of-the-art performance on ML-SUPERB and shows competitive results on FLEURS and resynthesis, with notable gains on long-tail languages. This work advances practical universal speech representations and promotes reproducibility through public data, code, and intermediate checkpoints.

Abstract

Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.

Towards Robust Speech Representation Learning for Thousands of Languages

TL;DR

XEUS tackles the challenge of speech representation for thousands of languages by scaling SSL pre-training to more than

hours and

languages. It introduces a novel acoustic-dereverberation objective and uses an open, large-scale pre-training dataset; architecture is an

-layer E-Branchformer with HuBERT-style masked prediction and WavLM denoising. It achieves state-of-the-art performance on ML-SUPERB and shows competitive results on FLEURS and resynthesis, with notable gains on long-tail languages. This work advances practical universal speech representations and promotes reproducibility through public data, code, and intermediate checkpoints.

Abstract

Paper Structure (38 sections, 1 equation, 3 figures, 11 tables, 1 algorithm)

This paper contains 38 sections, 1 equation, 3 figures, 11 tables, 1 algorithm.

Introduction
Motivation and Related Work
Speech Representation Learning
Robust Speech Representations
Open Foundation Models
Data
Existing Datasets
MMS-unlab v2
WikiTongues
Jesus Dramas
Final Pre-Training Corpus
Self-Supervised Pre-Training
Masked Prediction and Denoising
Dereverberation
Model Architecture
...and 23 more sections

Figures (3)

Figure 1: Distribution of XEUS pre-training data by language (log scale). We exclude data from YODAS yodas due to the noisiness of the language labels.
Figure 2: Overview of XEUS' pre-training. The teacher encoder generates phonetic pseudo-labels from clean speech, while the student must predict those pseudo-labels after masking, random noise and/or reverberation is applied to the input waveform.
Figure 3: Distribution of data between the 189 language families in the XEUS pre-training data. We use Glottolog (https://glottolog.org/) to automatically map each ISO3 code to a language family.

Towards Robust Speech Representation Learning for Thousands of Languages

TL;DR

Abstract

Towards Robust Speech Representation Learning for Thousands of Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (3)