Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Yonggan Fu, Yang Zhang, Kaizhi Qian, Zhifan Ye, Zhongzhi Yu, Cheng-I Lai, Yingyan Celine Lin
TL;DR
S$^3$-Router introduces a mask-based finetuning paradigm that applies language-/task-specific binary masks to shared self-supervised speech model weights, enabling up to 10% weight pruning without compromising performance. The framework yields a versatile, scalable solution for multilingual and multitask speech processing, improves efficiency via sparsity-driven pruning, and helps analyze what SSL models encode through learned masks. Empirical results across LibriSpeech, CommonVoice, and SUPERB tasks show improvements in WER and PER over standard weight finetuning, with strong cross-lingual transfer and robustness to larger SSL backbones. The approach offers practical on-device deployment potential and provides a data-driven lens into the structure of speech SSL representations, with open-source code to reproduce and extend the work.
Abstract
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.
