MelHuBERT: A simplified HuBERT on Mel spectrograms
Tzu-Quan Lin, Hung-yi Lee, Hao Tang
TL;DR
This work tackles the high compute barrier of self-supervised speech learning by rethinking HuBERT's architecture and training. It replaces the waveform-based front-end with Mel spectrogram inputs, uses a simple cross-entropy loss over k-means centroids, and employs staged pre-training, including a second stage guided by phone-relevant targets. Empirically, MelHuBERT achieves competitive phone recognition, speaker identification, and ASR performance while reducing pre-training time by 31.2% and MACs per second by 33.5% on the 360-hour LibriSpeech subset; even stronger gains appear in low-resource settings. The paper also provides a detailed analysis of the learned representations, showing how MelHuBERT and HuBERT differ and where each has strengths, and releases code and models to enable broader access to efficient self-supervised speech learning.
Abstract
Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.
