VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Yiwei Guo; Chenpeng Du; Ziyang Ma; Xie Chen; Kai Yu

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

TL;DR

VoiceFlow is an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps, and subjective and objective evaluations showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart.

Abstract

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

TL;DR

Abstract

Paper Structure (14 sections, 6 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Flow Matching and Rectified Flow
Flow Matching Generative Models
Improved Sampling Efficiency with Rectified Flow
VoiceFlow
Flow Matching-Based Acoustic Model
Sampling and Flow Rectification Step
Experiments and Results
Experimental Setup
Subjective Evaluations
Objective Evaluations
Ablation Study
Conclusion
Acknowledgement

Figures (4)

Figure 1: Working diagram of the VoiceFlow model
Figure 2: MOSnet evaluations in multiple choices of sampling steps
Figure 3: MCD evaluations in multiple choices of sampling steps
Figure 4: Visualization of sampling trajectories

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

TL;DR

Abstract

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (4)