Table of Contents
Fetching ...

Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Valentin Vielzeuf

TL;DR

The paper addresses why top layers of self-supervised speech models like HuBERT can retain input-like information and how to mitigate this autoencoder behavior. By analyzing HuBERT’s iterative pretraining and proposing three strategies that adjust iteration structure and clustering height, the authors demonstrate that progressive scheduling and cluster dynamics can preserve higher-level representations, accelerate convergence, and improve downstream WER on LibriLight, with robust layerwise and semantic-level probing. The key contributions include a detailed factorized analysis of top-layer representations, a progressive pretraining strategy that slightly improves performance, and a substantial reduction in pretraining time (about 2x) while maintaining or enhancing downstream results. These findings offer practical guidance for training efficiency and representation quality in self-supervised speech, with potential applicability to other architectures and more spontaneous speech domains.

Abstract

Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech Recognition.To better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.

Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

TL;DR

The paper addresses why top layers of self-supervised speech models like HuBERT can retain input-like information and how to mitigate this autoencoder behavior. By analyzing HuBERT’s iterative pretraining and proposing three strategies that adjust iteration structure and clustering height, the authors demonstrate that progressive scheduling and cluster dynamics can preserve higher-level representations, accelerate convergence, and improve downstream WER on LibriLight, with robust layerwise and semantic-level probing. The key contributions include a detailed factorized analysis of top-layer representations, a progressive pretraining strategy that slightly improves performance, and a substantial reduction in pretraining time (about 2x) while maintaining or enhancing downstream results. These findings offer practical guidance for training efficiency and representation quality in self-supervised speech, with potential applicability to other architectures and more spontaneous speech domains.

Abstract

Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech Recognition.To better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
Paper Structure (5 sections, 1 equation, 10 figures, 1 table, 1 algorithm)

This paper contains 5 sections, 1 equation, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 2: Greedy word error rate evolution (finetuning is performed on LibriLight-10h every 80k pretraining steps) on LibriSpeech Dev-Clean for different pretraining procedures. Comparison with the official HuBERT is done at 650,000 minibatch steps.
  • Figure : Wav2vec2.0 - CCA-Word
  • Figure : CCA-Word
  • Figure : Wav2vec2.0 - CCA-Word
  • Figure : HuBERT First Iteration - CCA-Word
  • ...and 5 more figures