UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Alexander H. Liu; Sang-gil Lee; Chao-Han Huck Yang; Yuan Gong; Yu-Chiang Frank Wang; James R. Glass; Rafael Valle; Bryan Catanzaro

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

TL;DR

UniWav introduces a unified encoder-decoder pre-training framework that jointly learns speech representations and a generative decoder to support both recognition and generation tasks. By leveraging self-distillation with online clustering for the encoder and a Flow Matching-based decoder conditioned on encoder features, UniWav is trained end-to-end from scratch and evaluated on speech recognition, in-context text-to-speech, and speech tokenization. Across LibriSpeech-based tasks, UniWav achieves competitive results with task-specific baselines and demonstrates strong low-bitrate tokenization quality, showing the viability of a single foundation model for both understanding and generation in speech. Analyses reveal that the joint training yields representations with information content useful for both linguistic and speaker/environment cues, while encoder capacity significantly shapes discriminative performance, highlighting practical trade-offs and future directions for unified speech foundation models.

Abstract

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

TL;DR

Abstract

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)