Table of Contents
Fetching ...

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models

Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh

TL;DR

This work systematically analyzes how the design of prediction targets in HuBERT-style masked pre-training affects downstream speech tasks. By examining initial targets, clustering layers, and target granularity, it reveals suboptimalities in common setups and introduces Layer Multi-Target and RVQ-token prediction to produce more informative representations. The findings show that content-focused and information-rich targets improve performance across phoneme recognition, speaker identification, and speech separation, with pragmatic approaches to bypass exhaustive layer searches. Overall, the proposed strategies offer a principled path to more versatile speech foundation models with better cross-task transfer.

Abstract

Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstream tasks. For instance, models pre-trained with targets that capture prosody learn representations suited for speaker-related tasks, while those pre-trained with targets that capture phonetics learn representations suited for content-related tasks. Moreover, prediction targets can differ in the level of detail they capture. Models pre-trained with targets that encode fine-grained acoustic features perform better on tasks like denoising, while those pre-trained with targets focused on higher-level abstractions are more effective for content-related tasks. Despite the importance of prediction targets, the design choices that affect them have not been thoroughly studied. This work explores the design choices and their impact on downstream task performance. Our results indicate that the commonly used design choices for HuBERT can be suboptimal. We propose approaches to create more informative prediction targets and demonstrate their effectiveness through improvements across various downstream tasks.

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models

TL;DR

This work systematically analyzes how the design of prediction targets in HuBERT-style masked pre-training affects downstream speech tasks. By examining initial targets, clustering layers, and target granularity, it reveals suboptimalities in common setups and introduces Layer Multi-Target and RVQ-token prediction to produce more informative representations. The findings show that content-focused and information-rich targets improve performance across phoneme recognition, speaker identification, and speech separation, with pragmatic approaches to bypass exhaustive layer searches. Overall, the proposed strategies offer a principled path to more versatile speech foundation models with better cross-task transfer.

Abstract

Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstream tasks. For instance, models pre-trained with targets that capture prosody learn representations suited for speaker-related tasks, while those pre-trained with targets that capture phonetics learn representations suited for content-related tasks. Moreover, prediction targets can differ in the level of detail they capture. Models pre-trained with targets that encode fine-grained acoustic features perform better on tasks like denoising, while those pre-trained with targets focused on higher-level abstractions are more effective for content-related tasks. Despite the importance of prediction targets, the design choices that affect them have not been thoroughly studied. This work explores the design choices and their impact on downstream task performance. Our results indicate that the commonly used design choices for HuBERT can be suboptimal. We propose approaches to create more informative prediction targets and demonstrate their effectiveness through improvements across various downstream tasks.
Paper Structure (23 sections, 2 figures, 5 tables)

This paper contains 23 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Design decisions (marked in red) in the iterative clustering (HuBERT) procedure that affect prediction targets. Detailed descriptions of these decisions are provided in Section \ref{['sec:method']}.
  • Figure 2: Contribution of each transformer layer when predicting more informative targets. Evaluated on PR (left), SID (middle), and SS (right). Darker color means higher contribution. The $0^{th}$ layer refers to the input to the transformer.