Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

Pin-Jui Ku; Alexander H. Liu; Roman Korostik; Sung-Feng Huang; Szu-Wei Fu; Ante Jukić

Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

Pin-Jui Ku, Alexander H. Liu, Roman Korostik, Sung-Feng Huang, Szu-Wei Fu, Ante Jukić

TL;DR

The paper addresses high-quality speech restoration by replacing vocoder-based synthesis with a flow-matching pretraining framework operating directly on complex STFT coefficients. It introduces a STFT-based model (approximately 430M parameters) that uses a transformer with adaptive normalization to learn a vector field for time-domain reconstruction via an ODE solver, trained with conditional flow matching objectives. The approach is pretrained on Libri-Light and finetuned on four tasks (denoising, bandwidth extension, codec artifact removal, and target speaker extraction), achieving state-of-the-art or strong improvements across all tasks and removing the need for task-specific vocoders. The method, along with pretrained checkpoints, is publicly released in NVIDIA NeMo, positioning it as a general foundational model for speech restoration and generation with practical impact for diverse audio-enhancement applications.

Abstract

This paper proposes a generative pretraining foundation model for high-quality speech restoration tasks. By directly operating on complex-valued short-time Fourier transform coefficients, our model does not rely on any vocoders for time-domain signal reconstruction. As a result, our model simplifies the synthesis process and removes the quality upper-bound introduced by any mel-spectrogram vocoder compared to prior work SpeechFlow. The proposed method is evaluated on multiple speech restoration tasks, including speech denoising, bandwidth extension, codec artifact removal, and target speaker extraction. In all scenarios, finetuning our pretrained model results in superior performance over strong baselines. Notably, in the target speaker extraction task, our model outperforms existing systems, including those leveraging SSL-pretrained encoders like WavLM. The code and the pretrained checkpoints are publicly available in the NVIDIA NeMo framework.

Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

TL;DR

Abstract

Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

Authors

TL;DR

Abstract

Table of Contents