ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

Wenhao Guan; Qi Su; Haodong Zhou; Shiyu Miao; Xingjia Xie; Lin Li; Qingyang Hong

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao, Xingjia Xie, Lin Li, Qingyang Hong

TL;DR

Diffusion-based TTS often suffers from slow sampling due to many iterations. ReFlow-TTS introduces a rectified-flow ODE that transports a standard Gaussian $\pi_0$ to the Mel-spectrogram distribution $\pi_1$ along near-straight paths, enabling high-fidelity synthesis with a single sampling step and without teacher pretraining. The model uses a simple unconstrained least-squares objective and a RK45 ODE solver for inference, with an optional 2-ReFlow-TTS variant to further improve efficiency. On LJSpeech, ReFlow-TTS achieves competitive or state-of-the-art results among diffusion-based TTS models while significantly reducing sampling requirements, highlighting its practical potential for real-world speech synthesis.

Abstract

The diffusion models including Denoising Diffusion Probabilistic Models (DDPM) and score-based generative models have demonstrated excellent performance in speech synthesis tasks. However, its effectiveness comes at the cost of numerous sampling steps, resulting in prolonged sampling time required to synthesize high-quality speech. This drawback hinders its practical applicability in real-world scenarios. In this paper, we introduce ReFlow-TTS, a novel rectified flow based method for speech synthesis with high-fidelity. Specifically, our ReFlow-TTS is simply an Ordinary Differential Equation (ODE) model that transports Gaussian distribution to the ground-truth Mel-spectrogram distribution by straight line paths as much as possible. Furthermore, our proposed approach enables high-quality speech synthesis with a single sampling step and eliminates the need for training a teacher model. Our experiments on LJSpeech Dataset show that our ReFlow-TTS method achieves the best performance compared with other diffusion based models. And the ReFlow-TTS with one step sampling achieves competitive performance compared with existing one-step TTS models.

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

TL;DR

Diffusion-based TTS often suffers from slow sampling due to many iterations. ReFlow-TTS introduces a rectified-flow ODE that transports a standard Gaussian

to the Mel-spectrogram distribution

along near-straight paths, enabling high-fidelity synthesis with a single sampling step and without teacher pretraining. The model uses a simple unconstrained least-squares objective and a RK45 ODE solver for inference, with an optional 2-ReFlow-TTS variant to further improve efficiency. On LJSpeech, ReFlow-TTS achieves competitive or state-of-the-art results among diffusion-based TTS models while significantly reducing sampling requirements, highlighting its practical potential for real-world speech synthesis.

Abstract

Paper Structure (12 sections, 4 equations, 5 figures, 3 tables)

This paper contains 12 sections, 4 equations, 5 figures, 3 tables.

Introduction
Rectified Flow Model
ReFlow-TTS
Rectified Flow Model for TTS
Model Architecture
Experiments
Experimental Setup
Dataset
Evaluation Metrics
Comparative Models
Audio Performance
Conclusions

Figures (5)

Figure 1: The graphical model for the Rectified Flow Model. (a) Linear interpolation of data samples $(X_{0},X_{1})$. (b) The Rectified Flow $Z_{t}$ induced by $(X_{0},X_{1})$. (c) The linear interpolation of data samples $(Z_{0},Z_{1})$ of rectified flow $Z_{t}$. (d) The rectified flow induced from $(Z_{0},Z_{1})$ and it follows straight paths.
Figure 2: An illustration of ReFlow-TTS.
Figure 3: The visualization results of Mel-spectrograms for compared models.
Figure 4: The visualization results of Mel-spectrograms for one step sampling TTS models.
Figure 5: The visualization results of Mel-spectrograms for ReFlow-TTS and 2-ReFlow-TTS.

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

TL;DR

Abstract

ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (5)