Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining
Valentin Vielzeuf
TL;DR
The paper addresses why top layers of self-supervised speech models like HuBERT can retain input-like information and how to mitigate this autoencoder behavior. By analyzing HuBERT’s iterative pretraining and proposing three strategies that adjust iteration structure and clustering height, the authors demonstrate that progressive scheduling and cluster dynamics can preserve higher-level representations, accelerate convergence, and improve downstream WER on LibriLight, with robust layerwise and semantic-level probing. The key contributions include a detailed factorized analysis of top-layer representations, a progressive pretraining strategy that slightly improves performance, and a substantial reduction in pretraining time (about 2x) while maintaining or enhancing downstream results. These findings offer practical guidance for training efficiency and representation quality in self-supervised speech, with potential applicability to other architectures and more spontaneous speech domains.
Abstract
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech Recognition.To better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
