F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang
TL;DR
F5R-TTS addresses the challenge of integrating reinforcement learning with non-autoregressive flow-matching TTS by reformulating deterministic flow outputs into probabilistic Gaussian distributions. It then applies GRPO with WER and SIM rewards in a second phase, significantly improving semantic accuracy and speaker similarity in zero-shot voice cloning. Experimental results on Mandarin data and internal datasets show substantial reductions in WER (around 29.5% relative) and consistent SIM gains (about 4–6%), demonstrating the practical viability of RL-based fine-tuning for NAR TTS. The work highlights a two-phase training paradigm and demonstrates the benefits of reward-driven policy optimization for flow-matching TTS architectures.
Abstract
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (a 29.5% relative reduction in WER) and speaker similarity (a 4.6% relative increase in SIM score) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.
