ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech
Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao, Xingjia Xie, Lin Li, Qingyang Hong
TL;DR
Diffusion-based TTS often suffers from slow sampling due to many iterations. ReFlow-TTS introduces a rectified-flow ODE that transports a standard Gaussian $\pi_0$ to the Mel-spectrogram distribution $\pi_1$ along near-straight paths, enabling high-fidelity synthesis with a single sampling step and without teacher pretraining. The model uses a simple unconstrained least-squares objective and a RK45 ODE solver for inference, with an optional 2-ReFlow-TTS variant to further improve efficiency. On LJSpeech, ReFlow-TTS achieves competitive or state-of-the-art results among diffusion-based TTS models while significantly reducing sampling requirements, highlighting its practical potential for real-world speech synthesis.
Abstract
The diffusion models including Denoising Diffusion Probabilistic Models (DDPM) and score-based generative models have demonstrated excellent performance in speech synthesis tasks. However, its effectiveness comes at the cost of numerous sampling steps, resulting in prolonged sampling time required to synthesize high-quality speech. This drawback hinders its practical applicability in real-world scenarios. In this paper, we introduce ReFlow-TTS, a novel rectified flow based method for speech synthesis with high-fidelity. Specifically, our ReFlow-TTS is simply an Ordinary Differential Equation (ODE) model that transports Gaussian distribution to the ground-truth Mel-spectrogram distribution by straight line paths as much as possible. Furthermore, our proposed approach enables high-quality speech synthesis with a single sampling step and eliminates the need for training a teacher model. Our experiments on LJSpeech Dataset show that our ReFlow-TTS method achieves the best performance compared with other diffusion based models. And the ReFlow-TTS with one step sampling achieves competitive performance compared with existing one-step TTS models.
