MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
TL;DR
MS-HuBERT addresses pre-training/inference mismatch in MLM-based speech representation learning by introducing Swap, which exposes full context during pre-training, and Multicluster MPL, which leverages multiple cluster resolutions to better utilize model capacity. Built on HuBERT’s architecture, it demonstrates improved ASR performance on Librispeech, especially in low-resource settings, and matches data2vec in high-resource scenarios, while also delivering strong content-based task performance on SUPERB. The embeddings learned during pre-training encode substantial information useful for downstream tasks, validating the approach's effectiveness and efficiency. The work highlights a path to closer integration between pre-training objectives and real-world inference, with practical implications for robust speech representations and downstream NLP- or ASR-related applications.
Abstract
In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.
