Table of Contents
Fetching ...

F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang

TL;DR

F5R-TTS addresses the challenge of integrating reinforcement learning with non-autoregressive flow-matching TTS by reformulating deterministic flow outputs into probabilistic Gaussian distributions. It then applies GRPO with WER and SIM rewards in a second phase, significantly improving semantic accuracy and speaker similarity in zero-shot voice cloning. Experimental results on Mandarin data and internal datasets show substantial reductions in WER (around 29.5% relative) and consistent SIM gains (about 4–6%), demonstrating the practical viability of RL-based fine-tuning for NAR TTS. The work highlights a two-phase training paradigm and demonstrates the benefits of reward-driven policy optimization for flow-matching TTS architectures.

Abstract

We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (a 29.5% relative reduction in WER) and speaker similarity (a 4.6% relative increase in SIM score) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.

F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization

TL;DR

F5R-TTS addresses the challenge of integrating reinforcement learning with non-autoregressive flow-matching TTS by reformulating deterministic flow outputs into probabilistic Gaussian distributions. It then applies GRPO with WER and SIM rewards in a second phase, significantly improving semantic accuracy and speaker similarity in zero-shot voice cloning. Experimental results on Mandarin data and internal datasets show substantial reductions in WER (around 29.5% relative) and consistent SIM gains (about 4–6%), demonstrating the practical viability of RL-based fine-tuning for NAR TTS. The work highlights a two-phase training paradigm and demonstrates the benefits of reward-driven policy optimization for flow-matching TTS architectures.

Abstract

We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (a 29.5% relative reduction in WER) and speaker similarity (a 4.6% relative increase in SIM score) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R.

Paper Structure

This paper contains 11 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We conducted zero-shot voice cloning experiments comparing three distinct models across different datasets. The evaluation was performed from two key perspectives: speaker similarity (measured by SIM) and semantic accuracy (measured by WER). Higher SIM and lower WER indicate superior performance.
  • Figure 2: The backbone of F5R-TTS, which is derived from flow-matching based TTS model. The most significant difference in our model is the modification of the final linear layer to accurately predict probability distributions for each flow step.
  • Figure 3: The pipeline of the GRPO phase. We employ an ASR model and a speaker encoder to derive rewards, which is subsequently used to optimize the policy model. KL divergence is incorporated as the penalty term to enhance training stability during GRPO phase.
  • Figure 4: The visualization of speaker similarity by t-SNE. From left to right, three columns correspond to F5, F5-P, and F5-R, respectively. Each small number in the graph is an utterance sample. Different numbers or colors correspond to different target speakers. Numbers with an asterisk mean reference utterances of the target speaker whom the number stands for. Numbers without an asterisk refer to synthesized utterances. And some badcases are marked out with red arrows.
  • Figure 5: The global variance of ground truth speaker utterance and synthesized utterances from different models. In each subgraph, the horizontal axis represents the mel bins number and the vertical axis represents the variance. And there are 4 GV curves in each subgraph corresponding to different sources. The corresponding relationship of the curves is shown in the legend where gt means ground truth.