Table of Contents
Fetching ...

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

TL;DR

ML-SUPERB 2.0 revisits multilingual speech benchmarking by relaxing ML-SUPERB's fixed-downstream constraint and introducing larger downstream models, fine-tuning options, adapters/LoRA, and supervised pre-trained models. It evaluates on a broad corpus (≈300 hours, 142 languages, 15 datasets) using ESPnet and layer-aggregated SSL-plus-supervised encoder outputs under a 100M parameter cap, enabling diverse configurations and fair comparisons. The study finds no universally superior architecture; in general, E-Branchformer performs well, middle-layer fine-tuning and LoRA adapters offer strong results, and supervised pre-trained models do not consistently outperform SSL baselines, especially under limited data. The results reveal large language- and dataset-induced variability, motivating targeted multilingual representations and more nuanced benchmarking to support robust, real-world deployment.

Abstract

ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

TL;DR

ML-SUPERB 2.0 revisits multilingual speech benchmarking by relaxing ML-SUPERB's fixed-downstream constraint and introducing larger downstream models, fine-tuning options, adapters/LoRA, and supervised pre-trained models. It evaluates on a broad corpus (≈300 hours, 142 languages, 15 datasets) using ESPnet and layer-aggregated SSL-plus-supervised encoder outputs under a 100M parameter cap, enabling diverse configurations and fair comparisons. The study finds no universally superior architecture; in general, E-Branchformer performs well, middle-layer fine-tuning and LoRA adapters offer strong results, and supervised pre-trained models do not consistently outperform SSL baselines, especially under limited data. The results reveal large language- and dataset-induced variability, motivating targeted multilingual representations and more nuanced benchmarking to support robust, real-world deployment.

Abstract

ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.
Paper Structure (19 sections, 4 tables)