Table of Contents
Fetching ...

Generative Pre-training for Speech with Flow Matching

Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

TL;DR

This work pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions, and suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Abstract

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Generative Pre-training for Speech with Flow Matching

TL;DR

This work pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions, and suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Abstract

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
Paper Structure (52 sections, 6 equations, 4 figures, 12 tables)

This paper contains 52 sections, 6 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: An overview of SpeechFlow. (Left) Pre-training with masked audio. (Right) Fine-tuning with task-specific condition such as noisy recording, overlapped speech, or phone sequence. More details of the model and conditioning are available in Section \ref{['subsec:model']}.
  • Figure 2: Blue blocks are learnable weights. (Left) Model architecture. Time (flow step) $t$ is encoded using sinusoidal position embedding with learnable scale. (Right) Condition used for different tasks. For TTS fine-tuning, learnable phone embedding sequence aligned to the spectrogram is elementwise added to the masked spectrogram. Since phone embeddings are randomly initialized and added to the masked spectrogram, we found ramping up a zero-initialized gating value (single scalar to be multiplied on phone embedding) yields slightly better results in practice.
  • Figure 3: Impact of different pre-training hyper-parameters on zero-shot speaker adaptation (ZSSA) TTS and enhancement. The dashed line stands for the baseline performance without pre-training.
  • Figure 10: Additional results of English zero-shot speaker adaptation TTS experiment with different pre-training hyper-parameters. To reduce computation, models in this table are only pre-trained for 300k steps.