ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Jiatong Shi; Shih-Heng Wang; William Chen; Martijn Bartelds; Vanya Bannihatti Kumar; Jinchuan Tian; Xuankai Chang; Dan Jurafsky; Karen Livescu; Hung-yi Lee; Shinji Watanabe

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

TL;DR

ML-SUPERB 2.0 revisits multilingual speech benchmarking by relaxing ML-SUPERB's fixed-downstream constraint and introducing larger downstream models, fine-tuning options, adapters/LoRA, and supervised pre-trained models. It evaluates on a broad corpus (≈300 hours, 142 languages, 15 datasets) using ESPnet and layer-aggregated SSL-plus-supervised encoder outputs under a 100M parameter cap, enabling diverse configurations and fair comparisons. The study finds no universally superior architecture; in general, E-Branchformer performs well, middle-layer fine-tuning and LoRA adapters offer strong results, and supervised pre-trained models do not consistently outperform SSL baselines, especially under limited data. The results reveal large language- and dataset-induced variability, motivating targeted multilingual representations and more nuanced benchmarking to support robust, real-world deployment.

Abstract

ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

TL;DR

Abstract

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Authors

TL;DR

Abstract

Table of Contents