Table of Contents
Fetching ...

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

Jiatong Shi, Dan Berrebbi, William Chen, Ho-Lam Chung, En-Pei Hu, Wei Ping Huang, Xuankai Chang, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe

TL;DR

ML-SUPERB addresses the English-centric limitation of the original SUPERB by introducing a multilingual benchmark across 143 languages for ASR and language identification. It adopts the same frozen-SSL feature extraction and lightweight downstream framework as SUPERB, enabling efficient evaluation across monolingual and multilingual tracks with small training subsets. Key findings show that multilingual SSL models can outperform mono-language baselines (e.g., XLSR-128 often leads) but that larger or multilingual models do not universally outperform their monolingual counterparts, highlighting nuanced generalization dynamics. The work provides an open, reproducible evaluation framework and datasets to catalyze future multilingual representation learning and benchmarking, with a public portal for ongoing participation.

Abstract

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Following the concept of SUPERB, ML-SUPERB utilizes frozen SSL features and employs a simple framework for multilingual tasks by learning a shallow downstream model. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features. Furthermore, we find that multilingual models do not always perform better than their monolingual counterparts. We will release ML-SUPERB as a challenge with organized datasets and reproducible training scripts for future multilingual representation research.

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

TL;DR

ML-SUPERB addresses the English-centric limitation of the original SUPERB by introducing a multilingual benchmark across 143 languages for ASR and language identification. It adopts the same frozen-SSL feature extraction and lightweight downstream framework as SUPERB, enabling efficient evaluation across monolingual and multilingual tracks with small training subsets. Key findings show that multilingual SSL models can outperform mono-language baselines (e.g., XLSR-128 often leads) but that larger or multilingual models do not universally outperform their monolingual counterparts, highlighting nuanced generalization dynamics. The work provides an open, reproducible evaluation framework and datasets to catalyze future multilingual representation learning and benchmarking, with a public portal for ongoing participation.

Abstract

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Following the concept of SUPERB, ML-SUPERB utilizes frozen SSL features and employs a simple framework for multilingual tasks by learning a shallow downstream model. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features. Furthermore, we find that multilingual models do not always perform better than their monolingual counterparts. We will release ML-SUPERB as a challenge with organized datasets and reproducible training scripts for future multilingual representation research.
Paper Structure (12 sections, 1 equation, 1 figure, 4 tables)

This paper contains 12 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The layerwise weight analysis of XLSR-128 model in the monolingual track.