MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

TL;DR

MS-HuBERT addresses pre-training/inference mismatch in MLM-based speech representation learning by introducing Swap, which exposes full context during pre-training, and Multicluster MPL, which leverages multiple cluster resolutions to better utilize model capacity. Built on HuBERT’s architecture, it demonstrates improved ASR performance on Librispeech, especially in low-resource settings, and matches data2vec in high-resource scenarios, while also delivering strong content-based task performance on SUPERB. The embeddings learned during pre-training encode substantial information useful for downstream tasks, validating the approach's effectiveness and efficiency. The work highlights a path to closer integration between pre-training objectives and real-world inference, with practical implications for robust speech representations and downstream NLP- or ASR-related applications.

Abstract

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

TL;DR

Abstract

Paper Structure (18 sections, 8 figures, 6 tables)

This paper contains 18 sections, 8 figures, 6 tables.

Introduction
Method
Background
MS-HuBERT
Swap
Multicluster MPL
Experimental Details
Results
Main Results: Supervised Fine-tuning and Inference
MS-HuBERT as a Feature Extractor
3rd Iteration Models
Evaluation of Individual Layers on the SUPERB Benchmark
Discussion
Conclusion and Future Work
Limitations
...and 3 more sections

Figures (8)

Figure 1: Proposed MS-HuBERT approach, an end-to-end self supervised pre-training method to learn robust speech representations. The input raw audio is passed to a CNN encoder. Two copies of the output is created i.e., masked and unmasked. Which is passed through the Swap modified 2nd encoder. Multicluster Masked prediction loss is calculated, masked indices only, on the output embeddings from different blocks of the modified 2nd encoder.
Figure 2: Solid lines show the CCA similarity with the word labels. Dotted lines show the AUC area under the curv for different models.
Figure 3: CCA similarity with the word labels for MS-HuBERT and its variants. The S-HuBERT curve is similar to WavLM.
Figure 4: PER values are plotted with the x-axis representing layers 5 to 12, ordered from left to right.
Figure 5: WER values on the y-axis should be divided by 100 to obtain the final WER. The x-axis represents layers from 5 to 12, ordered from left to right.
...and 3 more figures

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

TL;DR

Abstract

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)