Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

Yonggan Fu; Yang Zhang; Kaizhi Qian; Zhifan Ye; Zhongzhi Yu; Cheng-I Lai; Yingyan Celine Lin

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

Yonggan Fu, Yang Zhang, Kaizhi Qian, Zhifan Ye, Zhongzhi Yu, Cheng-I Lai, Yingyan Celine Lin

TL;DR

S$^3$-Router introduces a mask-based finetuning paradigm that applies language-/task-specific binary masks to shared self-supervised speech model weights, enabling up to 10% weight pruning without compromising performance. The framework yields a versatile, scalable solution for multilingual and multitask speech processing, improves efficiency via sparsity-driven pruning, and helps analyze what SSL models encode through learned masks. Empirical results across LibriSpeech, CommonVoice, and SUPERB tasks show improvements in WER and PER over standard weight finetuning, with strong cross-lingual transfer and robustness to larger SSL backbones. The approach offers practical on-device deployment potential and provides a data-driven lens into the structure of speech SSL representations, with open-source code to reproduce and extend the work.

Abstract

Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

TL;DR

-Router introduces a mask-based finetuning paradigm that applies language-/task-specific binary masks to shared self-supervised speech model weights, enabling up to 10% weight pruning without compromising performance. The framework yields a versatile, scalable solution for multilingual and multitask speech processing, improves efficiency via sparsity-driven pruning, and helps analyze what SSL models encode through learned masks. Empirical results across LibriSpeech, CommonVoice, and SUPERB tasks show improvements in WER and PER over standard weight finetuning, with strong cross-lingual transfer and robustness to larger SSL backbones. The approach offers practical on-device deployment potential and provides a data-driven lens into the structure of speech SSL representations, with open-source code to reproduce and extend the work.

Abstract

-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S

-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S

-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.

Paper Structure (20 sections, 1 equation, 6 figures, 9 tables)

This paper contains 20 sections, 1 equation, 6 figures, 9 tables.

Introduction
Related Work
The Proposed S$^3$-Router Framework
Drawn Inspirations from Previous Work
Formulation and Optimization of S$^3$-Router
How to Initialize the Masks in S$^3$-Router?
S$^3$-Router is Useful in Various Application Scenarios
S$^3$-Router: Discarding $\leq$10% Weights is All You Need
Experiment Setup
Benchmark on Low-resource English ASR
Benchmark on Low-resource Cross-lingual Transfer
Benchmark on More Downstream Speech Processing Tasks
Empowering Multilingual and Multitask Speech Processing
Benchmark S$^3$-Router with Adaptor Tuning
S$^3$-Router-P: Pruning ASR Models for Enhancing Efficiency
...and 5 more sections

Figures (6)

Figure 1: An overview of our S$^3$-Router framework, which receives multilingual speech signals denoted as A, B, and C here and then outputs the corresponding text transcript of predication, based on one shared weight model together with language-/task-specific binary masks.
Figure 2: Benchmark our S$^3$-Router and standard weight finetuning on the test-clean/test-other sets of LibriSpeech on top of wav2vec2-base/large under different low-resource settings.
Figure 3: Benchmark our S$^3$-Router and weight finetuning on xlsr across 10 spoken languages.
Figure 4: Benchmark our S$^3$-Router-P against OMP, IMP, and PARP lai2021parp for pruning wav2vec2-base on LibriSpeech. The WER on the test-clean set is reported.
Figure 5: Benchmark our S$^3$-Router-P against OMP and PARP for pruning on Mandarin.
...and 1 more figures

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

TL;DR

Abstract

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)