Table of Contents
Fetching ...

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

TL;DR

UniWav introduces a unified encoder-decoder pre-training framework that jointly learns speech representations and a generative decoder to support both recognition and generation tasks. By leveraging self-distillation with online clustering for the encoder and a Flow Matching-based decoder conditioned on encoder features, UniWav is trained end-to-end from scratch and evaluated on speech recognition, in-context text-to-speech, and speech tokenization. Across LibriSpeech-based tasks, UniWav achieves competitive results with task-specific baselines and demonstrates strong low-bitrate tokenization quality, showing the viability of a single foundation model for both understanding and generation in speech. Analyses reveal that the joint training yields representations with information content useful for both linguistic and speaker/environment cues, while encoder capacity significantly shapes discriminative performance, highlighting practical trade-offs and future directions for unified speech foundation models.

Abstract

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

TL;DR

UniWav introduces a unified encoder-decoder pre-training framework that jointly learns speech representations and a generative decoder to support both recognition and generation tasks. By leveraging self-distillation with online clustering for the encoder and a Flow Matching-based decoder conditioned on encoder features, UniWav is trained end-to-end from scratch and evaluated on speech recognition, in-context text-to-speech, and speech tokenization. Across LibriSpeech-based tasks, UniWav achieves competitive results with task-specific baselines and demonstrates strong low-bitrate tokenization quality, showing the viability of a single foundation model for both understanding and generation in speech. Analyses reveal that the joint training yields representations with information content useful for both linguistic and speaker/environment cues, while encoder capacity significantly shapes discriminative performance, highlighting practical trade-offs and future directions for unified speech foundation models.

Abstract

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.

Paper Structure

This paper contains 36 sections, 11 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An overview of UniWav. The encoder is trained with masked audio modeling and pseudo-label obtained through a teacher model. The teacher model is the exponential moving average (EMA) of the encoder. The decoder is trained with Flow Matching conditioned on $z$ the weighted sum of representations of different encoder layers. All modules are trained jointly from scratch.
  • Figure 2: Mutual information between quantized representation and phone/speaker (left/right) at different layers. Results are computed on the dev set of LibriSpeech. For quantization, 1024 clusters are used for k-means.
  • Figure 3: Utterance 7176-88083-0014 from test-clean