Table of Contents
Fetching ...

MelHuBERT: A simplified HuBERT on Mel spectrograms

Tzu-Quan Lin, Hung-yi Lee, Hao Tang

TL;DR

This work tackles the high compute barrier of self-supervised speech learning by rethinking HuBERT's architecture and training. It replaces the waveform-based front-end with Mel spectrogram inputs, uses a simple cross-entropy loss over k-means centroids, and employs staged pre-training, including a second stage guided by phone-relevant targets. Empirically, MelHuBERT achieves competitive phone recognition, speaker identification, and ASR performance while reducing pre-training time by 31.2% and MACs per second by 33.5% on the 360-hour LibriSpeech subset; even stronger gains appear in low-resource settings. The paper also provides a detailed analysis of the learned representations, showing how MelHuBERT and HuBERT differ and where each has strengths, and releases code and models to enable broader access to efficient self-supervised speech learning.

Abstract

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.

MelHuBERT: A simplified HuBERT on Mel spectrograms

TL;DR

This work tackles the high compute barrier of self-supervised speech learning by rethinking HuBERT's architecture and training. It replaces the waveform-based front-end with Mel spectrogram inputs, uses a simple cross-entropy loss over k-means centroids, and employs staged pre-training, including a second stage guided by phone-relevant targets. Empirically, MelHuBERT achieves competitive phone recognition, speaker identification, and ASR performance while reducing pre-training time by 31.2% and MACs per second by 33.5% on the 360-hour LibriSpeech subset; even stronger gains appear in low-resource settings. The paper also provides a detailed analysis of the learned representations, showing how MelHuBERT and HuBERT differ and where each has strengths, and releases code and models to enable broader access to efficient self-supervised speech learning.

Abstract

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.
Paper Structure (18 sections, 3 equations, 3 figures, 5 tables)

This paper contains 18 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An overview of HuBERT and MelHuBERT on model architecture.
  • Figure 2: Top: CCA similarity between different layers and phones. Bottom: CCA similarity between different layers and log Mel spectrograms. C1 indicates the first convolution layer, T1 indicates the first Transformer layer, and feat is the input to T1. The models are pre-trained on LibriSpeech 360-hour subset.
  • Figure 3: Pre-raining time required to epoch 50, 100, 150, and 200 on the 100-hour subset of LibriSpeech, and the respective downstream ASR performance.