Table of Contents
Fetching ...

PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion

Vikentii Pankov, Artem Gribul, Oktai Tatanov, Vladislav Proskurov, Yuliya Korotkova, Darima Mylzenova, Dmitrii Vypirailenko

TL;DR

PFluxTTS introduces a dual-decoder flow-matching TTS system that blends explicit duration control with alignment-free fluency via inference-time vector-field fusion. It adds FLUX-based voice cloning with speech-prompt embeddings to preserve speaker identity across languages and introduces a PeriodWave vocoder with super-resolution to 48 kHz. In cross-lingual, in-the-wild evaluations, it outperforms major open baselines in intelligibility and matches or surpasses commercial systems in speaker similarity, while maintaining robustness with short reference prompts and no extra training. The work demonstrates that combining dual-path decoding, prompt-conditioned cloning, and high-fidelity vocoding yields practical, dubbing-friendly multilingual TTS.

Abstract

We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/

PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion

TL;DR

PFluxTTS introduces a dual-decoder flow-matching TTS system that blends explicit duration control with alignment-free fluency via inference-time vector-field fusion. It adds FLUX-based voice cloning with speech-prompt embeddings to preserve speaker identity across languages and introduces a PeriodWave vocoder with super-resolution to 48 kHz. In cross-lingual, in-the-wild evaluations, it outperforms major open baselines in intelligibility and matches or surpasses commercial systems in speaker similarity, while maintaining robustness with short reference prompts and no extra training. The work demonstrates that combining dual-path decoding, prompt-conditioned cloning, and high-fidelity vocoding yields practical, dubbing-friendly multilingual TTS.

Abstract

We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
Paper Structure (10 sections, 4 equations, 2 figures, 3 tables)

This paper contains 10 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Architecture of PFluxTTS. Duration-Guided and Alignment-Free models are mixed with schedule $\alpha(t)$ during inference. On the right, Speech Prompt Encoder is shown, which outputs either an embedding sequence for the DG model or fixed embedding for the AF model.
  • Figure 2: Effect of inference-time model fusion on intelligibility of PFluxTTS (CER as a function of mixing coefficient $\alpha$).